Quantcast
Channel: Intel® Many Integrated Core Architecture
Viewing all articles
Browse latest Browse all 1347

vector/parallel optimization

$
0
0

I've posted my examples based on netlib vectors benchmark at https://github.com/tprince/lcd

These demonstrate how to optimize vectorization (and, where appropriate, parallelization) for MIC (native) and host, and for multiple host compilers, using C, C++, Cilk(tm) Plus, and Fortran.  A white paper document is included.

Data regions are initialized under Fortran OpenMP to take advantage of first touch placement.  OpenMP in C or C++ test kernels takes advantage of this (Cilk(tm) Plus not so much).

About Intel's implementation of pragma omp simd:

There are cases where it's necessary to make the correct choice between legacy Intel simd directive and OpenMP4, as Intel doesn't apply pragma omp simd in the sense of avoiding implicit temporary arrays (although gcc does).  Unsafe usage of the legacy pragma (e.g. with unspecified reduction) must be replaced with OpenMP 4 equivalents which are implemented in Intel 15.0 compilers, even though the latter may be slower in some cases where the former appear to work.

One case where it's necessary to use !dir$ simd under ifort and !$omp simd under gfortran has been filed on Premier in the hope of resolution or expert explanation.

Intel implementation of pragma omp parallel for simd can achieve performance up to 3x what is possible by threading or vectorization alone.  I include cases which optimize parallel regions containing parallel for simd and non-simd, and parallel and single loops.

Points specific to MIC: 

1.  frequent usefulness of separate conditional compiled code paths for MIC.  In many cases, it involves simd directives to promote vectorization for MIC but disable it for host (although there are cases where the vector code is slow on MIC but good on host).  The simd directives don't appear to justify the advertising slogan about directive based vectorization as there are many cases where vectorization is better without directive, on host or MIC, (or both).   In a couple of cases the conditional compilation chooses the same version for MIC and AVX2.  In a few cases where the vectorization shows a gain over Intel compiler non-vector code, it doesn't match gnu compiler non-vector performance, so it is misleading to brag about vectorization except as a means for keeping MIC viable.

2.  MIC doesn't reach full performance of either vector or parallel at the loop lengths (1000 max) of the original benchmark.  In my experience, MIC offload fails if these data arrays are made large enough to take advantage of MIC.  I may post the offload and MIC native vector/parallel examples with larger data regions if there is interest.  The examples aren't really interesting for offload mode as it is inefficient.

3.  examples where I inserted code to schedule parallel work by thread, avoiding schedule(dynamic), improving cache locality, don't gain much on MIC, because there is fairly high overhead per thread, due in part to lack of MIC instruction level support.

4.  in most cases where it is necessary to choose between simd and threaded parallel, (i.e. due to the limitations of Cilk(tm) Plus and Fortran array syntax), vectorization should be chosen

5. parallel speedup as much as 80x is possible for cases where vectorization isn't possible

Specific to Cilk(tm) Plus:

1. __sec_implicit_index requires (int) cast for efficiency, given that the examples fall well within INT_MAX.  MIC has poor instruction level support for unsigned 64-bit integers.  Even on host, there is a penalty for unsigned 32-bit.

2.  cilk_for rarely reaches 50% of performance of OpenMP on MIC, in examples which take advantage of openmp nowait and cache locality, even though these examples use at most 118 threads or workers.

3. the examples are set up so that the transition from OpenMP to cilk_for isn't affected by KMP_BLOCKTIME.  Note that a Cilk worker can't use a hardware thread context while it is held by KMP_BLOCKTIME.

4. I show a case where it is worth while to choose different source code for single thread or cilk_for for large loop count.

Specific to Fortran:

In many cases where there is an array assignment to an array which appears in effect as intent(inout), ifort will allocate a temporary and memcpy, even if there is no possible overlap.  Resolution involves combinations of switching to Fortran 77 and application of legacy simd directive.  In all but one similar case with gfortran, the f77 alone resolves it (the remaining case using omp simd to resolve).  Neither ifort nor gfortran handle forward-back direction switching as a way to resolve concerns about array assignment data overlap.

Specific to Intel 15.0 (and latest gnu compilers):

Besides implementing all the OpenMP 4 demonstrated in these benchmarks, these compilers implement optimization of stride -1 by shuffles, avoiding need for any intrinsics (a few of which are shown as alternatives). Cases where Intel C++ used to lose performance by violating parentheses appear to have been resolved.

............

I'm planning to post more white papers on optimization, perhaps to be influenced by any interest which may be expressed.


Viewing all articles
Browse latest Browse all 1347

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>