Parallel offload and memory retention in OpenMP 4.0

Hi,

I have two questions:

1) Is there any way to have memory retention with #pragma omp target device(0), similar to #pragma offload target(mic:0) in(p: length(size) align(align) alloc_if(i==0) free_if(i==reps-1))?

2) Why parallel offload without memory retention is serialized?

Let's take a look at the following code:

#include <stdio.h>
#include <omp.h>

#if TEST==1
#define OFFLOAD offload_transfer target(mic:m) in(p: length(size) align(align) alloc_if(i==0) free_if(i==reps-1))
#elif TEST==2
#define OFFLOAD offload_transfer target(mic:m) in(p: length(size) align(align))
#elif TEST==3
#define OFFLOAD omp target device(m) map(to:p[0:size])
#endif

int main(int argv, char** argc){

    int mics = 1;
    if (argv>1) mics = atoi(argc[1]);

    int reps = 3;
    if (argv>2) reps = atoi(argc[2]);

    int align = 64;
    if (argv>3) align = atoi(argc[3]);

    for (size_t size = 1L; size < 1L<<34; size *= 2){
        char * data[mics];
        for(int m=0; m<mics; m++)
            data[m]=(char*) _mm_malloc(size, align);
        for (int i = 0; i<reps ; i++){
            {
                for(int m=0; m<mics; m++)
                    data[m][0:size] = i;
                double time = 0.0;
                double bw = 1.0;
#pragma omp parallel for reduction(+:bw) reduction(max:time)
                for(int m = 0; m < mics; m++){
                    char * p = data[m];
                    const double t1 = omp_get_wtime();
#pragma OFFLOAD
                            { }
                    time = omp_get_wtime() - t1;
                    bw = (size / time)/ (1L<<20);
                }
                printf("out: %6d\t%6d\t%12ld\t%9.6f\t%9.3f\n",
                        mics, i, size, time, bw);
            }
        }
        for(int m = 0; m < mics; m++)
            _mm_free(data[m]);
    }
}

I'm getting the following result for serial offload to one Intel Xeon Phi coprocessor:

First offload with both #pragma offload and #pragma omp target is very slow, with the maximum bandwidth close to only 0.5GB/s, which can be improved by using huge 2MB pages: export MIC_USE_2MB_BUFFERS=0. This will increase the bandwidth to 1.2GB/s (see the next plot).

Regular offload pragmas allocate and deallocate memory on Xeon Phi coprocessor on each offload call. Maximum bandwidth for second and all later offload calls is 2.2GB/s.

By using memory retention we can get to the PCIe v2 bandwidth limit, which is ~6.3GB/s.

Using 2MB pages we can improve initial offload of new data for both #pragma offload and #pragma omp target:

But I think the most interesting part is parallel offload. Let's take a look at scaling offload bandwidth if we use 1, 2, 3, and 4 Intel Xeon Phi coprocessors. Data transfer should run in parallel, because we using up to 4 OpenMP threads on the host system.

This result is a little bit confusing. Regular parallel #pragma offload and #pragma omp target without memory retention showed almost the same bandwidth as serial offload to one device. And only offload with memory retention scales linearly with the number of devices.

Using modified code of Intel PCM tool (https://software.intel.com/en-us/articles/intel-performance-counter-monitor) and also monitoring the memory of Intel Xeon Phi coprocessors, I've been able to plot the timeline of memory utilization and data transfer over PCIe buses on Socket 0 and Socket 1 of the host system. 8GB of data transfered to each of 4 Intel Xeon Phi coprocessor 5 times. We comparing regular parallel offload with #pragma offload and #pragma omp target (top), and parallel offload with memory retention (bottom):

It is obvious from this timeline plot, that for the regular parallel #pragma offload and #pragma omp target data transfer to each Xeon Phi coprocessor is serialized. And only if we use memory retention with alloc_if(0)/free_if(0) clauses we observe simultaneous data transfer to each device. Therefore, I'm wondering if it possible to have memory retention with OpenMP 4.0 target pragma. And why we observing this serialization at all? May be there is some implementation limits in offload runtime library? Or may be it's a bug? Can anyone explain what is going on behind the scenes? Any help will be appreciated. Thank you!

Allegato	Dimensione
Scarica offload_1.cpp	3.17 KB
Scarica fig1.png	116.65 KB
Scarica fig2_3.png	119.27 KB
Scarica fig3a.png	181.37 KB
Scarica timeline2MB.png	135.56 KB

Parallel offload and memory retention in OpenMP 4.0

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112