Quantcast
Channel: Intel® Many Integrated Core Architecture
Viewing all 1347 articles
Browse latest View live

performance difference between AO and CAO

$
0
0

Hi,
I see performance differences between AO and CAO models in calling MKL zgemm(or dgemm) routines. In my tests, AO is working well as expected but CAO shows poor performances compared to AO. For example, an AO routine calling zgemm with the matrix size of 10k takes 11.5 seconds, but a CAO routine calling the same zgemm within "#pragma offload target(mic)" takes 26.3 seconds. Also, I could see a difference in the MIC usage between the two models as can be seen in the attached capture where 2 MPI processors are running with an MIC card each. What can possibly make this difference? Could you please give me any advice?

Thanks,
Hong

## CAO code for zgemm ###

void cao_mkl_zgemm(int micid, char *transa, char *transb, int M, int N, int K, MKL_Complex16 *alpha, MKL_Complex16 *A, int lda, MKL_Complex16 *B, int ldb, MKL_Complex16 *beta, MKL_Complex16 *C, int ldc)
{
  #pragma offload target(mic: micid) \
  in(transa, transb, M, N, K, lda, ldb, ldc) \
  in(alpha[0:1] : into (amic[0:1]) align(64)) \
  in(beta[0:1] : into (bmic[0:1]) align(64)) \
  in(A[0:(M*K)] : into (Amic[0:(M*K)]) free_if(0) align(64)) \
  in(B[0:(K*N)] : into (Bmic[0:(K*N)]) free_if(0) align(64)) \
  in(C[0:(M*N)] : into (Cmic[0:(M*N)]) free_if(0) align(64)) \
  out(Cmic[0:(M*N)] : into (C[0:(M*N)]) align(64))
  {
    zgemm_(transa, transb, &M, &N, &K, amic, Amic, &lda, Bmic, &ldb, bmic, Cmic, &ldc);
  }
}


Free Online training on Parallel Programming and Optimization

$
0
0

Colfax Is offering free Web-based workshops on Parallel Programming and Optimization for Intel® Architecture, including Intel® Xeon® processors and Intel® Xeon Phi™ coprocessors. Workshops include 20 hours of Web-based instruction and up to 3 weeks of remote access to dedicated training servers for hands-on exercises. Workshops beginning on September 9 and October 13 are free to everyone thanks to Intel’s sponsorship.

The Colfax Hands On Workshop (HOW) training series is an integral part of the Intel Modern Code Developer program which supports developers in leveraging application performance in code through a systematic optimization methodology.

Attendees of these workshops may receive a certificate of completion. The certificate states the Fundamental level of accomplishment in the Parallel Programming Track. Attending at least 6 out of 10 live broadcast sessions is required to receive the certificate.

Check out this link for more details and to register.

Immagine icona: 

  • Modernizzazione codici
  • Architettura Intel® Many Integrated Core
  • Elaborazione parallela
  • Threading
  • Vettorizzazione
  • C/C++
  • Server
  • Sviluppatori
  • Professori
  • Studenti
  • Linux*
  • Includere in RSS: 

    1
  • Principiante
  • Intermedio
  • System board for Intel Xeon Phi S5120D

    Windows 10 support

    $
    0
    0

    Dear Colleaggues,

    I am in the process of getting a workstation with a 3120A XEON Phi card. Due to Windows 7 memory limitation support (max 192GB) I am forced to consider Windows 8 or Windows 10. I am inclined to go with Windows 10 but I could not find anything about Windos 10 support for the coprocessor.

    So, is XEON Phi supported by windows 10? If not, are there plans to provide drivers for Windows 10 in the near future?

    Thanks,

    Dragos

    OpenMP 4.0 target offload Report

    $
    0
    0

    Hi ..

    I am trying to make a comparison statistics of offload using,

    1). Intel compiler assisted offload VS. 2). OPENMP 4.0 target construct 

    My QUESTION: HOW I CAN GET OPENMP 4.0 OFFLOAD REPORT(which environment variable I need to set..?), I used OFFLOAD Report=2; intel compiler directive offload it worked fine, BUT I AM GETTING VERY STRANGE STATISTICS WITH OPENMP 4.0 OFFLOAD (I am using Intel Xeon Phi as execution platform)

    Here is the code

    COMPILER DIRECTIVE OFFLOAD:

    // Start time
            gettimeofday(&start, NULL);

            // Run SAXPY 
            #pragma offload target(mic:0) inout(x) out(y)
            {
                            #pragma omp parallel for default (none) shared(a,x,y)
                            for (i = 0; i < n; ++i){
                                    y[i] = a*x[i] + y[i];
                            }                                                        
            } // end of target data

            // end time 
            gettimeofday(&end, NULL);

    OPENMP 4.0 TARGET OFFLOAD:

    // Start time
            gettimeofday(&start, NULL);

            // Run SAXPY
            #pragma omp target data map(to:x)
            {
                    #pragma omp target map(tofrom:y)
                    {
                            #pragma omp parallel for
                            for (i = 0; i < n; ++i){
                                    y[i] = a*x[i] + y[i];
                            }
                    }
            } // end of target data

     

    ------

    Thanks in advace. (Raju)

     

    No Cost Options for Intel Integrated Performance Primitives Library (IPP), Support Yourself, Royalty-Free

    $
    0
    0

    The Intel® Integrated Performance Primitives Library (Intel® IPP), a high performance library with thousands of optimized functions for x86 and x86-64, is available for free for everyone (click here to register and download). Purchasing is only necessary if you want access to Intel® Premier Support (direct 1:1 private support from Intel), older versions of the library or access to other tools in Intel® Parallel Studio XE or Intel® System Studio. Intel continues to actively develop and support this very powerful library - and everyone can benefit from that!

    Intel® IPP is an extensive library which includes thousands of optimized functions covering frequently used fundamental algorithms including those for creating digital media, enterprise data, embedded, communications, and scientific/technical applications.  Intel IPP includes routines for Image Processing, Computer Vision, Data Compression, Signal Processing and (with an optional add-on) Cryptography. Intel IPP is available for Linux*, OS X* and Windows* under the Community Licensing program currently.

    Intel® IPP is shipped with the Intel® Compilers and all the other Intel® Performance Libraries in various products from Intel. It can be obtained with tools for analysis, debugging and tuning, tools for MPI and the Intel® MPI Library by acquiring the Intel® Parallel Studio XE or with Android support with Intel® System Studio. Did you know that some of these are available for free?

    Here is a guide to various ways to obtain the latest version of the Intel® Integrated Performance Primitives Library (Intel® IPP) for free without access to Intel® Premier Support (get support by posting to the Intel Integrated Performance Primitives Library forum). Anytime you want, the full suite of tools (Intel® Parallel Studio XE or Intel® System Studio) with Intel® Premier Support and access to previous library versions can be purchased worldwide.

    WhoWhat is Free?InformationWhere?
    Community Licenses for Everyone

    Intel® Integrated Performance Primitives (Intel® IPP - Linux*, Windows* or OS X* versions)

    Intel® Data Analytics Acceleration Library
    (Intel® DAAL - Linux*, Windows* or OS X* versions)

    Intel® Math Kernel Library (Intel® MKL - Linux* or Windows* versions)

    Intel® Threading Building Blocks
    (Intel® TBB - Linux*, Windows* or OS X* versions)

    Community Licensing for Intel® Performance Libraries – free for all, registration required, no royalties, no restrictions on company or project size, current versions of libraries, no Intel Premier Support access.

    Forums for discussion and support are open to everyone:

    Community Licensing for Intel Performance Libraries
    Evaluation Copies for Everyone

    Intel® Integrated Performance Primitives (Intel® IPP)
    along with Compilers, libraries and analysis tools (most everything!)

    Evaluation Copies – Try before you buy.

    Intel® Parallel Studio for Linux, Windows or OS X versions;

    Intel® System Studio for Android, Linux or Windows.

    Try Intel Parallel Studio (with Intel IPP) before you buy: Linux, Windows or OS X.

    Try Intel System Studio (with Intel IPP) before you buy: Android, Linux or Windows.

    Use as an Academic Researcher

    Linux, Windows or OS X versions of:

    Intel® Integrated Performance Primitives

    Intel® Data Analytics Acceleration Library

    Intel® Math Kernel Library

    Intel® Threading Building Blocks

    Intel® MPI Library (not available for OS X)

    If you will use in conjunction with academic research at institutions of higher education.

    (Linux, Windows or OS X versions, except the Intel® MPI Library which is not supported on OS X, and Intel® MKL which is not available standalone on OS X)

    Qualify for Use as an Academic Researcher
    Student

    Intel® Integrated Performance Primitives (Intel® IPP)
    along with Compilers, libraries and analysis tools (most everything!)

    If you are a current student at a degree-granting institutions.

    Intel® Parallel Studio for Linux, Windows or OS X versions;

    Intel® System Studio for Android, Linux or Windows.

    Qualify for Use as a Student
    Teacher

    Intel® Integrated Performance Primitives (Intel® IPP)
    along with Compilers, libraries and analysis tools (most everything!)

    If you will use in a teaching curriculum.

    Intel® Parallel Studio for Linux, Windows or OS X versions;

    Intel® System Studio for Android, Linux or Windows.

    Qualify for Use as an Educator
    Use as an
    Open Source Contributor

    Intel® Integrated Performance Primitives (Intel® IPP)
    along with all of the
    Intel® Parallel Studio XE Professional Edition for Linux

    If you are a developer actively contributing to a open source projects – and that is why you will utilize the tools.

    (Linux versions)

    Qualify for Use as an Open Source Contributor

    Free licenses for certain users has always been an important dimension in our offerings. One thing that really distinguishes Intel is that we sell excellent tools and provide second-to-none support for software developers who buy our tools. We provide multiple options - and we hope you will find exactly what you need in one of our options.

     

    Immagine icona: 

  • Modernizzazione codici
  • Strumenti di sviluppo
  • Architettura Intel® Many Integrated Core
  • Ottimizzazione
  • Elaborazione parallela
  • Threading
  • Vettorizzazione
  • Intel® Cluster Ready
  • Message Passing Interface
  • OpenMP*
  • C/C++
  • Fortran
  • Sviluppatori
  • Professori
  • Studenti
  • Apple OS X*
  • Linux*
  • Microsoft Windows* (XP, Vista, 7)
  • Microsoft Windows* 10
  • Microsoft Windows* 8.x
  • Includere in RSS: 

    1
  • Avanzato
  • Principiante
  • Intermedio
  • Compile OpenMP or MPI Fortran code for Intel Phi

    $
    0
    0

    Hi everyone,

    Here is my problem:

    I have two different programs:

    • One in Fortran / MPI
    • One in Fortran / OpenMP

    And I would like to compile them in order to have them running on an Intel Xeon Phi.

    I just installed the free-version-for-academics of the Parallel Studio Cluster Edition 2016 on my server.

    Here are my questions:

    • What I do not know is: which compiler should I use to compile my fortran code ?
    • Where are the includes and libraries for mpi/fortran ?
    • Should I install the bindings to get access to things like mpif.h, mpifort etc. ?

    I am asking this, because until now I was running and compiling my code on a supercomputer (using openmpi.intel module) that gave me access to "mpifort", however now, on my local server, using the  Parallel Studio Cluster Edition 2016, I do not see any trace of "mpifort". So, I am wondering should I use "ifort" with some options to point to the lib and includes ?

    Also, if there is some documentation somewhere about this kind of things I would love to read it.

    Thanks in advance for your help.

    How to compile SSE intrinsic code in KNL

    $
    0
    0

    Hello Sir or Madam,

    As we know KNC not support SSE..., and AVX.., It's only support IMCI instruction. So SSE intrinsic code can't compile in KNC. How about KNL, KNL is support SSE...SSE4.2 and AVX ...AVX-512. So there is my question, how to compile SSE intrinsic code in KNL.

    Here is my part of code like:

    void foo (U8 * pInput, U8 * pOutput)

    {

          __m128i vByte15_00, vByte31_16, vByte47_32, vByte63_48;
          __m128i * pIn;
          pIn = (__m128i *) pInput;

          vByte15_00 = _mm_lddqu_si128 (pIn++);
          vByte31_16 = _mm_lddqu_si128 (pIn++);
          vByte47_32 = _mm_lddqu_si128 (pIn++);
          vByte63_48 = _mm_lddqu_si128 (pIn++);

    }

    My Parallel Studio linux-zoon:~ # icc --version
    icc (ICC) 16.0.0 20150501
    Copyright (C) 1985-2015 Intel Corporation.  All rights reserved.

    Once I compile the code with -mmic, it will show error log like "undefined reference to `__must_be_linked_with_icc_or_xild'"

    Could you please help me figure it out? Thanks a lot!

    Best Regards,

    Lei

     

     

     


    MIC on ubuntu 15.04

    A Brief Survey of NUMA (Non-Uniform Memory Architecture) Literature

    $
    0
    0

    This document presents a list of articles on NUMA (Non-uniform Memory Architecture) that the author considers particularly useful. The document is divided into categories corresponding to the type of article being referenced. Often the referenced article could have been placed in more than one category. In this situation, the reference to the article is placed in what the author thinks is the most relevant category. These articles were obtained from the Internet and, though every attempt was made to identify useful and informative material, Intel does not provide any guarantees as to the veracity of the material. It is expected that the reader will use their own experience and knowledge to challenge and confirm the material in these references.

    Where beneficial, some comments (indented and in italics) as to the usefulness and content of an article is included.

    Contents

    INTRODUCTORY AND OVERVIEW

    FUNDAMENTAL

    HISTORICAL

    OPERATING SYSTEMS

    TOOLS

    CHARACTERIZATION AND OPTIMIZATION

    CASE STUDIES

     

    INTRODUCTORY AND OVERVIEW

    Lameter, Christoph. (August 2013). NUMA (Non-Uniform Memory Access): An Overview, ACM Queue, Vol. 11, no. 7. Retrieved on September 1st, 2015 from http://queue.acm.org/detail.cfm?id=2513149.

    Comment: Linux focused with a moderate list of references.

    Panourgias, Iakovos. (September 9th, 2011). NUMA effects on multicore, multi socket systems, MSc Thesis, University of Edinburgh. Retrieved on September 1st, 2015 from http://static.ph.ed.ac.uk/dissertations/hpc-msc/2010-2011/IakovosPanourgias.pdf.

    Comment: HPC benchmark focused; discussed from a programming perspective (vs an OS administrative); comprehensive.

    Non-uniform memory access, Wikipedia. Retrieved September 1st, 2015 from https://en.wikipedia.org/wiki/Non-uniform_memory_access.

    Comment: Good set of references.

    Manchanda, Nakul, and Karan Anand. (May 5th, 2010). "Non-Uniform Memory Access (NUMA)", Class thesis. New York University. Retrieved on September 1st, 2015 from http://cs.nyu.edu/~lerner/spring10/projects/NUMA.pdf.

    Yatendra Sharma. (February 10th, 2014). NUMA (Non-Uniform Memory Access): An Overview, Blog. Retrieved on September 1st, 2015 from http://yattutime.blogspot.com/2014/02/numa-non-uniform-memory-access-overview.html.

    FUNDAMENTAL

    Müller, Daniel. (9th December, 2013). Memory and Thread Management on NUMA Systems, Diploma Thesis, Technische Universität Dresden. Retrieved on September 1st, 2015 from http://os.inf.tu-dresden.de/papers_ps/danielmueller-diplom.pdf.

    Comment: Comprehensive and more technical.

    Denneman, Frank. (February 27th, 2015). Memory Deep Dive: NUMA and Data Locality, Blog. Retrieved on September 1st, 2015 from http://frankdenneman.nl/2015/02/27/memory-deep-dive-numa-data-locality.

    Comment: Part of a larger series on memory systems.

    HISTORICAL

    Bolosky, William J., Robert P. Fitzgerald, Michael L. Scott. (1989). Simple But Effective Techniques for NUMA Memory Management, ACM SIGOPS Oper. Syst. Rev., Vol. 23, No. 5, pp. 19-31. Retrieved on September 1st, 2015, from http://www.cs.berkeley.edu/~prabal/resources/osprelim/BFS89.pdf.

    Comment: Seminal paper.

    OPERATING SYSTEMS

    Linux Operating System. (August 8, 2012). NUMA(7) Manpage. Retrieved on September 1st, 2015 from http://man7.org/linux/man-pages/man7/numa.7.html.

    Drepper, Ulrich. (October 17th, 2007). Memory part 4: NUMA support, LWN.net. Retrieved on September 1st, 2015 from http://lwn.net/Articles/254445.

    Comment: LWN is Linux focused.

    Sourceforge. (November 20th, 2002). Linux Support for NUMA Hardware. Retrieved on September 1st, 2015 from http://lse.sourceforge.net/numa.

    Microsoft Corporation. NUMA Support (Windows), Windows Dev Center. Retrieved on September 1st, 2015 from https://msdn.microsoft.com/en-us/library/windows/desktop/aa363804(v=vs.85).aspx.

    Comment: Programming focused with API support.

    TOOLS

    McCurdy, Collin, and Jeffrey Vetter, (March 2010). Memphis: Finding and Fixing NUMA-related Performance Problems on Multi-core Platforms, ISPASS-2010: 2010 IEEE International Symposium on Performance Analysis of Systems and Software, March 28-30, 2010, White Plains, NY.

    Levinthal, David. Performance Analysis Guide for Intel® Core™ i7 Processor and Intel® Xeon™ 5500 processors , v1.0, Intel Developer Zone. Retrieved on September 1st, 2015 from https://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf.

    Comment: Use of VTune to look at NUMA.

    Lachaize, Renaud, Baptiste Lepers, and Vivien Quéma. (June 2012), MemProf: a Memory Profiler for NUMA Multicore Systems, 2012 USENIX Annual Technical Conference, June 13-15, 2012, Boston, MA.

    Zickus, Don. (May 31st, 2013). Dive deeper in NUMA systems, Red Hat Developer Blog. Retrieved on September 1st, 2015 from http://developerblog.redhat.com/2013/05/31/dive-deeper-in-numa-systems.

    Intel Corporation (March 1st, 2010). Detecting Memory Bandwidth Saturation in Threaded Applications, Intel Developer Zone. Retrieved on September 1st, 2015 from https://software.intel.com/en-us/articles/detecting-memory-bandwidth-saturation-in-threaded-applications/.

    CHARACTERIZATION AND OPTIMIZATION

    Ott, David. (November 2nd, 2011). Optimizing Applications for NUMA, Intel Developer Zone. Retrieved September 1st, 2015 from https://software.intel.com/en-us/articles/optimizing-applications-for-numa.

    Comment: There is a considerably older version of this article (2004) that is still accessible.

    Hently, David. (June 2012). Multicore Memory Caching Issues – NUMA. Series from Channel Cscsch, Centro Svizzero di Calcolo Scientifico. Presented at the PRACE Summer School 21-23 June 2012 - Summer School on Code Optimisation for Multi-Core and Intel MIC Architectures at the Swiss National Supercomputing Centre in Lugano, Switzerland. Video retrieved on September 1st, 2015 from https://www.youtube.com/watch?v=_cmViSD6Quw&index=17&list=PLAUXS_xuCc_rjvp-lJliGFtBPWpKNAY-y.

    Mario, Joe and Don Zickus. (August 2013). NUMA - Verifying it's not hurting your application performance, Redhat Developer Exchange, August 27, 2013, Boston, MA, USA. Retrieved September 1st, 2015 from http://developerblog.redhat.com/2013/08/27/numa-hurt-app-perf/.

    CASE STUDIES

    Leis, Viktor, Peter Boncz, Alfons Kemper and Thomas Neumann. (June 2014). Morsel-Driven Parallelism: A NUMA-Aware Query, Evaluation Framework for the Many-Core Age, SIGMOD’14, June 22–27, 2014, Snowbird, UT, USA. Retrieved September 1st, 2015 from http://www-db.in.tum.de/~leis/papers/morsels.pdf.

    Li, Yinan, Ippokratis Pandis Rene Mueller, Vijayshankar Raman and Guy Lohman. (January 2013). NUMA-aware algorithms: the case of data shuffling, 6th Biennial Conference on Innovative Data Systems Research (CIDR’13), January 6-9, 2013, Asilomar, California, USA. Retrieved September 1st, 2015 from http://www.cidrdb.org/cidr2013/Papers/CIDR13_Paper121.pdf.

     

     

     

    Author

    face

    Taylor Kidd is an engineer and frequent contributor to the Intel Developer Zone. He currently works on the Intel® Xeon Phi™ Scale Engineering Team producing developer facing content, and answering a variety of developer questions. Taylor has worked in a variety of fields in the past, including HPC, embedded systems, research and teaching.

     

     

  • server
  • Parallel Programming
  • Taylor Kidd
  • Intel Xeon Phi Coprocessor
  • MIC
  • Knights Landing
  • manycore
  • Many Core
  • KNL
  • Sviluppatori
  • Professori
  • Studenti
  • Server
  • Elaborazione basata su cluster
  • Processori Intel® Core™
  • Architettura Intel® Many Integrated Core
  • Ottimizzazione
  • Elaborazione parallela
  • Analisi piattaforma
  • Threading
  • Vettorizzazione
  • URL
  • Undefined MKL symbol when calling from within offloaded region

    $
    0
    0

    Hello,

    In an offloaded region of a Fortran90 application, I want to call MKL routines (dgetri/dgetrf) in a sequential way, that is, each thread on the MIC calls these routines with its own data. They aren't multithreaded calls.

    I use the Intel® Math Kernel Library Link Line Advisor (v4.4) .

    For : Linux / Compiler assisted offload / Intel Fortran / Dynamic / LP64 / Sequential , I get back :

    link line :  -L${MKLROOT}/lib/intel64 -lmkl_intel_lp64 -lmkl_core -lmkl_sequential -lpthread -lm

    compiler options :  -I${MKLROOT}/include -offload-attribute-target=mic -offload-option,mic,compiler," -L${MKLROOT}/lib/mic -lmkl_intel_lp64 -lmkl_core -lmkl_sequential"

     

    So I use them, I successfully compile the application and run my code : it crashes on the first call to a MKL routine (here dgetri) with the following error message :

    On the sink, dlopen() returned NULL. The result of dlerror() is "/var/volatile/tmp/coi_procs/1/8056/load_lib/ifortoutZ78m5J: undefined symbol: dgetri_"
    On the remote process, dlopen() failed. The error message sent back from the sink is /var/volatile/tmp/coi_procs/1/8056/load_lib/ifortoutZ78m5J: undefined symbol: dgetri_
    offload error: cannot load library to the device 0 (error code 20)

    The environment variables for the compiler, for the MKL are set with the proper scripts, for example

    echo $MIC_LD_LIBRARY_PATH
    /opt/intel/mic/coi/device-linux-release/lib:/opt/intel/mic/myo/lib:
    /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/mic:/opt/intel/composer_xe_2013_sp1.2.144/mkl/lib/mic:
    /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/mic:/opt/intel/composer_xe_2013_sp1.2.144/mpirt/lib/mic:
    /opt/intel/mic/coi/device-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/mic/coi/device-linux-release/lib:
    /opt/intel/mic/myo/lib:/opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/mic:
    /opt/intel/composer_xe_2013_sp1.2.144/mkl/lib/mic:/opt/intel/composer_xe_2013_sp1.2.144/tbb/lib/mic

    (Carriage returns are mine, for readability)

    I try the env. var. OFFLOAD_REPORT but I don't get anything else.

    So could someone tell me what's going wrong with compile or link step ?

     

    Moreover, is there a place where I can find the signification of error messages received during runtime on the MIC ? For example, in other situations, I got the following message

    offload error: cannot start process on the device 0 (error code 22)

    but micctrl tells me everything's fine.

    Thank you in advance for your advices.

    Regards

       Guy.

     

     

    MKL: cholesky decomposition error wih Xeon Phi

    $
    0
    0

    Hi,

    I got a simple C++ code to call lapack  dpotrf  function to do cholesky decomposition and dgetrf and  dgetri. I got very weird behavior. on  Xeon server with 6 Xeon Phi Cards

    1) Performance:

    For matrix size 12000x12000:

    run with export  MKL_MIC_ENABLE=1 is slower than MKL_MIC_ENABLE=0:    20 seconds vs 11 seconds

    It seems that MKL does not do a good job here.

     

    2) Bug:

    For cholesky decomposition: with matrix size 10000 x 10000 or 9500 x 9500 with MKL_MIC_ENABLE=1

    cause "Segmentation fault (core dumped)" with kernel log:

    traps: inverse[10032] general protection ip:7fc3b27955c7 sp:7fffe2da28c0 error:0 in libmkl_intel_thread.so[7fc3b206b000+ffc000]

    This problem happens on MKL 11.2.1 and  11.3.0 (newest Studio 2016).

    Any help please?

     

     

     

     

     

     

    Finite Differences on Heterogeneous Distributed Systems

    $
    0
    0

    Download Zip Source Code

    Here we exemplify how to expand Finite Difference (FD) computational kernels to run on distributed systems. Additionally, we describe a technique that shows how to deal with the load imbalance of heterogeneous distributed systems where different nodes or compute devices may provide distinct compute speeds. A sample source code is provided to exemplify our implementation.

    Our building block is the FD compute kernels that are typically used for RTM (reverse time migration) algorithms for seismic imaging. The computations performed by the ISO-3DFD (Isotropic 3-dimensional finite difference) stencils play a major role in accurate imaging of complex subsurface structures in oil and gas surveys and exploration. Here we leverage the ISO-3DFD discussed in [1] and [2] and illustrate a simple MPI-based distributed implementation that enables a distributed ISO-3DFD compute kernel to run on a hybrid hardware configuration consisting of host Intel® Xeon® processors and attached Intel® Xeon Phi™ coprocessors. We also explore Intel® software tools that help to analyze the load balance to improve performance and scalability.

    The Distributed ISO-3DFD

    Our illustrative case is a 1D decomposition of the ISO-3DFD compute stencil. We set up a computational domain that is split across MPI processes. For this example, we set up one MPI process per compute device (an Intel Xeon processor or an Intel Xeon Phi coprocessor). This implementation includes the halo exchanges required between MPI processes, that is, between processors and coprocessors. When domain decomposition is applied to an FD stencil as described in [1, 2], it is necessary to implement halo exchanges between subdomains at each time-step of the algorithm. This is because the value updates at domain points close to a border of the subdomain require values computed on an adjacent subdomain:

    for(int k=1; k<=HL; k++)    //Stencil Half-Length HL
              u_0 += W[k]*(
                U0(ix+k,iy  ,iz  ) + U0(ix-k,iy  ,iz  ) +
                U0(ix  ,iy+k,iz  ) + U0(ix  ,iy-k,iz  ) +
                U0(ix  ,iy  ,iz+k) + U0(ix  ,iy  ,iz-k));

    The order of the 3D stencil is defined by its half-length (HL) value: a stencil of 8th order has a half-length HL=8/2=4, for example. The “width” of the halos to be swapped between adjacent subdomains is also equal to HL.

    This sample code uses a symmetric execution model: the code runs on host processors and coprocessors. This can be accomplished via fully symmetric MPI execution with distinct processes running on Intel Xeon processors and on Intel Xeon Phi coprocessors. For example, assume a single two-socket system named hostname1 with two Intel Xeon Phi coprocessor cards (named hostname1-mic0, and hostname1-mic1) attached to the system’s x16 PCIe* slots. Also assume two executable binaries rtm.cpu (complied for the processor architecture, such as Intel® Advanced Vector Extensions 2 (Intel® AVX2), and rtm.phi (compiled for the architecture of the Intel Xeon Phi coprocessor). Using the Intel® MPI Library, one can leverage both executables in a MPI+OpenMP* symmetric mode execution:

    mpirun \
    -n 1 -host  hostname1  -env I_MPI_PIN_DOMAIN=socket  -env OMP_NUM_THREADS=14 ./rtm.cpu : \
    -n 1 -host  hostname1  -env I_MPI_PIN_DOMAIN=socket  -env OMP_NUM_THREADS=14 ./rtm.cpu : \
    -n 1 -host  hostname1-mic0 –env OMP_NUM_THREADS=244 ./rtm.phi  : \
    -n 1 -host  hostname1-mic1 –env OMP_NUM_THREADS=244 ./rtm.phi

    The simplified single node example above assumes that both rtm.cpu and rtm.phi are parallelized via OpenMP threading as described in [1, 2]. MPI is used for data exchanges and synchronization between nodes, processors, and coprocessors. OpenMP is used to divide MPI process compute work through the cores of a given processor or coprocessor. The above example can also be expanded for multiple nodes with processors and coprocessors. See [3] for more details on MPI symmetric mode execution.

    The simplified MPI harness presented here: 1) assumes that each Intel Xeon Phi coprocessor behaves like an independent compute node—no offloading programming syntax is used, and 2) allows asynchronous halo exchanges via non-blocking MPI calls. The overlapping of the compute and halo exchange with adjacent subdomains is accomplished by considering two types of compute regions on each subdomain:

    1. Local compute: Points with distance > HL from the adjacent borders. That is, points where the stencil computation relies only on values previously computed in the same subdomain.
    2. Halo compute: Points with distance <= HL from the adjacent borders. That is, points where the stencil computation relies only on values previously computed in the same subdomain.

    Schematically we have:

    The MPI implementation is in the style of a straightforward nearest-neighbor halo swap. First, buffers BufferPrev and BufferNext are used to post asynchronous receives for the halos needed from the neighbor subdomains:

    Next, points in the Halo compute region are updated first, because these will be needed by the adjacent subdomains:

    Updated values in the halo compute region are sent asynchronously to the adjacent domains: 

    As the asynchronous halo exchange happens, the points in the local compute region can be updated because these computations do not depend on values from adjacent subdomains:

    An MPI_Waitall synchronization call is then used to check for completion of the asynchronous halo exchanges. Finally, the values received in transfer buffers BufferPrev and BufferNext are copied to the subdomain:

    The actual implementation can be found in the sample source code package attached to this article. It contains an MPI layer running on top of the ISO-3DFD code previously published in [1, 2]. Each MPI process can be tuned for performance either via hardware settings such as turbo mode, hyperthreading, and ECC mode (Intel Xeon Phi coprocessor) or tuned through software optimization and tuning such as cache blocking, thread affinitization, and data prefetching distances. Refer to the original articles for details on single process optimization and tuning.

    Workload Balancing

    Workload balancing is a critical part of heterogeneous systems. The distributed ISO-3DFD exemplified here has tuning parameters that permit statically balancing the amount of computation work assigned to processor sockets and Intel Xeon Phi coprocessors. It is accomplished by two command-line parameters accepted by the executables:
    factor_speed_phi: Integer value representing how fast the FD code can run on the Intel Xeon Phi coprocessor.
    factor_speed_xeon: Integer value representing how fast the FD code can run on the Intel Xeon processor.

    The work-balance coefficient is the ratio factor_speed_phi / factor_speed_xeon that can be used to define how much work is assigned to an Intel Xeon processor and how much to an Intel Xeon Phi coprocessor. The balance between an Intel Xeon processor and an Intel Xeon Phi coprocessor can be defined at MPI launch time, allowing flexibility to support distinct Intel Xeon processors and Intel Xeon Phi coprocessor models.

    The static load balancing can be easily obtained with the help of an MPI message-profiling tool. Here we use Intel® Trace Analyzer and Collector (ITAC) [4] as the message-profiling tool. One way to accomplish this is by collecting an MPI trace and analyzing the load balance:

    1. Before running the experiment, source ITAC to the runtime environment:
      . /opt/intel/itac_latest/bin/itacvars.sh
    2. Run the MPI job with the option “-t” added to the mpirun command. This causes the Intel® MPI Library runtime to collect MPI message traces that are saved on files with extensions *.stf and *.prot.
    3. To visualize the results, use the ITAC GUI by launching the application traceanalyzer:
      Select open to access the *.stf file, and then select Continue to skip the Summary Page.
    4. In the next window, select  Charts -> Event Timeline, and then use the mouse to zoom in to the time line to be better able to visualize the communication pattern between processors and coprocessors.

    The following figures show an example of load balancing on a system with two processor sockets (MPI processes P2 and P3) and four Intel Xeon Phi coprocessors (MPI processes P0, P1, P4 and P5). Each MPI process is represented by a horizontal bar in the time-line graph. The time region marked in red represents the process potentially blocked by communication/synchronization; the time region marked in blue represents computing without communication. The goal of static load balancing is to minimize or eliminate the red regions, demonstrating good balance and processor unit utilization.

    For this synthetic example, the experiments are run on a dual-socket system with two Intel® Xeon® processor E5-2697 v2 and 64 GB of DDR3-1866 MHz, populated with four Intel® Xeon Phi™ coprocessor 7120 PCIe x16, each with 61 cores at 1.3 GHz, and 16 GB DDR5. We consider three sets of values for the pair factor_speed_phi and factor_speed_xeon. For the first case we set factor_speed_phi=10 and factor_speed_xeon=5, representing the assumption of ratio 10/5= 2 where one Intel Xeon Phi coprocessor computes 2x faster than one single processor socket. Assume the MPI tracing of the experiment resulted in:

    The above event time line suggests that processors (MPI process ranks P2 and P3) took longer to complete their work and the coprocessors (P0, P1, P4, and P5) were blocked idle (red regions) waiting for the work to be completed on the processors. In the second case we set factor_speed_phi=15 and factor_speed_xeon=5, representing the assumption of ratio 15/5= 3 where one Intel Xeon Phi coprocessor computes 3x faster than one single processor socket. The respective MPI tracing shows:

    The above event time line suggests that the coprocessors (P0, P1, P4, and P5) took longer to complete their work now 3x larger than the amount assigned to the processors (P2 and P3), causing the processors to block idle (red regions) waiting for the work to be completed.

    Finally, we set factor_speed_phi=13 and factor_speed_xeon=5, representing the assumption of ratio 13/5= 2.6 where one Intel Xeon Phi coprocessor computes 2.6x faster than one single processor socket:

    For this case, there is basically no noticable idle time due to load imbalance. This case represents an optimal load-balancing configuration.

    Note that the above static analysis applies to the specific hardware configuration (processor model, Intel Xeon Phi coprocessor model, number of processors, memory speed, and so on). Any change in hardware configuration or algorithm implementation would require a new static analysis.

    Sample Code

    To exemplify the above concepts, we provide sample source code in the attached sample source code package. The reader should be able to build, run, and analyze the example. The MPI job can be set to either run mpi ranks (processes) on both Intel Xeon processor sockets and on Intel Xeon Phi coprocessor; or only on processor sockets; or only on Intel Xeon Phi coprocessor cards.

    For the sake of simplicity, we do not provide or discuss the single process version of ISO-3DFD already presented and released in [1, 2]. Instead, we provide an additional MPI harness that extends the ISO-3DFD to a distributed ISO-3DFD that supports halo exchanges and load balancing between processors and coprocessors. In this way, future versions of ISO-3DFD can be replaced into this same sample code.

    The only dependencies to build this simple example are the original ISO-3DFD single process compute kernel that can be found in [1,2]; and Intel® software development tools (Intel® C and C++ compilers, Intel MPI Library, and ITAC).

    The README file contains intructions on how to build and run the sample code. The Makefile creates two executables: dist-iso-3dfd that runs on the processors and dist-iso-3dfd.MIC that runs on the Intel Xeon Phi coprocessor. Use the variables VERSION_MIC and VERSION_CPU to choose which version of the ISO-3DFD to use.

    The script run_example.sh suggests how to launch both executables using the Intel MPI Library. Refer to [3] for additonal information. The script exemplifies a way to run on a system populated with two processors and two Intel Xeon Phi coprocessor cards. It can be easily expanded to run on a cluster and on different node configurations. When the script variable TRACE_OPTION=-t is set, traces of the MPI communication are collected so that one can perform the static load balancing  as described in the previous section. The static load balancing is possible because the main source code dist-iso-3dfd.cc accepts the parameters factor_speed_phi and factor_speed_xeon as command-line options. Use these as described in the Workload Balancing section.

    The script also supports all the command-line options required by the main executable:
    nx  ny  nz  #threads  #iterations  x-blocking  y-blocking  z-blocking  factor_speed_phi  factor_speed_xeon
    and additional MPI and OpenMP settings.

    Note this is only an example. For optimal compiling options and runtime parameters for the ISO-3DFD compute kernel, refer to [1,2]. For performance results on single and multiple nodes refer to [5].

    Conclusion

    This article described a sample implementation for distributed FD methods, a 1D  decomposition of a ISO-3DFD stencil compute kernel. The implementation supports heterogenous systems with processors and coprocessors by supporting static loading balancing to deal with the different compute speeds of each device.

    Using an MPI message-profiling tool, one is able to analyze if at each time step either 1) the Intel Xeon Phi coprocessors are waiting for processors to be completed, or 2) Intel Xeon processors are waiting for the Intel Xeon Phi coprocessors to complete.

    References

    [1] “Optimizing and Running ISO 3DFD with Support for Intel® Xeon Phi™ Coprocessor.” http://software.intel.com/en-us/articles/eight-optimizations-for-3-dimensional-finite-difference-3dfd-code-with-an-isotropic-iso

    [2] “Characterization and Optimization Methodology Applied to Stencil Computations,” in the book High Performance Parallelism Pearls (J. Reinders and J. Jeffers, editors). http://www.techenablement.com/characterization-optimization-methodology-applied-stencil-computations/

    [3] “How to run Intel MPI on Xeon Phi” http://software.intel.com/en-us/articles/how-to-run-intel-mpi-on-xeon-phi

    [4] Tutorial: Analyzing MPI Applications with Intel® Trace Analyzer and Intel® VTune™ Amplifier XE. http://software.intel.com/en-us/analyzing-mpi-apps-with-itac-and-vtune

    [5] “Intel® Xeon Phi™ Coprocessor Energy Application Benchmarks.” http://www.intel.com/content/www/xr/en/benchmarks/server/xeon-phi/xeon-phi-energy.html

    Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. Check with your system manufacturer or retailer or learn more at intel.com.

    Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

    No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

    Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.

    This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.

    The products and services described may contain defects or errors known as errata which may cause deviations from published specifications. Current characterized errata are available on request.

    Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800-548-4725 or by visiting www.intel.com/design/literature.htm.

    This sample source code is released under the Intel Sample Source Code License Agreement.

    Intel, the Intel logo, Intel Xeon Phi, and Xeon are trademarks of Intel Corporation in the U.S. and/or other countries.

    *Other names and brands may be claimed as the property of others

  • seismic
  • RTM
  • stencil
  • 3D finite difference
  • 3DFD
  • distributed
  • Cluster
  • Intel® Xeon® processors
  • Intel® Xeon Phi™ Coprocessors
  • Sviluppatori
  • Linux*
  • Server
  • Message Passing Interface
  • OpenMP*
  • Elaborazione basata su cluster
  • Modernizzazione codici
  • Architettura Intel® Many Integrated Core
  • Ottimizzazione
  • Elaborazione parallela
  • URL
  • Coautori: 

    Cédric ANDREOLLI (Intel)
    PHILIPPE T.

    Where to find detailed code examples for offload in Fortran under both Linux and Windows?

    $
    0
    0

    Hi,

    I must congratulate Intel and closely linked companies for a massive and detailed information on how to introduce parallellization in computation-heave codes!  I also happened to win a sample of the Jeffers and Reinders book, which I found excellent due the step-by-step approach and detailed descriptions. I was convinced to invest in a new workstation with 4 Xeon Phi cards and also understood that MPI and the offload model was the right way for me.

    I agree with the comment by Bruce Weaver dated 04/05/2013 that detailed examples of various implementations with Fortran will save persons that are building models of e.g. physical phenomena and lack deep insights in the system anatomy of MIC a lot of time.

    I personally will now restructure my code in Fortran using MPI and the offload model primarily under Windows 7 Ultimate. I will however also do this under Linux as I am not totally convinced that Windows will work. The reason why Windows is important is the heavy use of graphics (QuickWin) in order to supervise the computations at various levels of the code. This use of graphics has turned out to be extremely effective in trouble shooting.

    I now wonder if there are any detailed descriptions of MPI on the host and offload to the MIC for Fortran under Windows and/or Linux. The code examples must be a  liitle more realistic than the Hello World examples.

    Best regards

    Anders S

    Finding elementwise and conditional matrix multiplication implementation with MKL

    $
    0
    0

    Hi all,

    I have been looking for an MKL version of elementwise matrix multiplication that works based on a condional approach.While Vmult can be used it is for only a 1D vector rather than a matrix.

    Below is the code I would like to rewrite with MKL version if possible.

    logical(log_kind) check(2000,2000)
    
    do i=1,2000
    
       do j=1,2000
    
       if ( check(i,j) )
    
       c(i,j) = a(i,j) * b(i,j)
    
       enddo
    
    enddo 

    I know Vmult helps, but it has no conditional operation.

    Is there is a conditional Vector library or elementwise matrix library.

    Can this be done by a combination of MKL library operations?

     

     

     

     


    New article “Finite Differences on Heterogeneous Distributed Systems”

    Intrinsic to down-convert all 8 elements of i64 vectors to lower/higher 8 elements of i32 vector

    $
    0
    0

    Is there such a thing?

    I think pack/unpack intrinsics are somewhere close, but I could not understand exactly what it does.

    It seems fairly basic I almost feel stupid asking this, but I would really appreciate a pointer.

    I would rather up-convert using a gather instruction, but AFAIK there is no up-conversion for gathering into an epi64 vector

    Any suggestions?

    Does remove the printenv in the latest MPSS

    $
    0
    0

    the printenv command exist in MPSS version 3-2.1.6720-16,  but it seems removed in 3.5. Does anyone know the reason?

    host-device bandwidth problem

    $
    0
    0

    Dear forum,

    I'm testing the host-device bandwidth using dapl fabric and Intel MPI (Isend/Irecv/Wait). 1.5 GB data are repeatedly sent back and forth. The initial result is:

    host to device: ~5.6 GB/sec
    device to host: ~5.8 GB/sec

    Problem 1: The first send-receive appears to be extremely slow. Its bandwidth is:

    host to device: ~2.6 GB/sec
    device to host: ~2.5 GB/sec

    I immediately thought of Linux' deferred memory allocation Jim pointed out in this post, so I memset the array prior to send/receive, but of little avail. So...is it because of the overhead of Intel MPI's first send/receive?

    Problem 2: When I increased the data size to 2 GB, the following message was displayed:

    [mic_name]:SCM:3be5:19664b40: 9659192 us(9659192 us!!!):  DAPL ERR reg_mr Cannot allocate memory

    The program can complete without a problem, though. So what causes that error message?

     

    Thanks for any advice.

    GLIBC_PRIVATE not defined in file ld-linux-x86-64.so.2 with link time reference

    $
    0
    0

    We are trying to set up a Gentoo-based system (4.0.5 kernel) as a development workstation capable of compiling offload applications for Xeon Phi. Because Gentoo does not work with RPMs, we installed MPSS 3.5.2 by extracting the RPM and placing the files in place. This procedure is described in Section 4 of this paper: http://colfaxresearch.com/installing-intel-mpss-3-3-in-arch-linux/  . In Arch Linux for MPSS 3.3 the procedure resulted in functional MPSS that was able to drive the Xeon Phi. Our current task is much simpler: we are not trying to run MPSS, we just want to use it to allow the Intel compiler to compile offload applications.

    The problem is below. The minimal reproducer code is

    int main() {
    #pragma offload target(mic)
      {
        printf("Hi\n");
      }
    }

    The compilation error is:

    $ icpc minimal_reproducer.cc
    x86_64-k1om-linux-ld: relocation error: /lib/libc.so.6: symbol _dl_find_dso_for_object, version GLIBC_PRIVATE not defined in file ld-linux-x86-64.so.2 with link time reference

    This is probably just an issue with missing files or environment variables, but I am at loss as where to look. The compiler version is 2016.0.109. It is `stable' gentoo with a glibc 2.20

    Any pointers would be much appreciated!

    Viewing all 1347 articles
    Browse latest View live


    <script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>