Explicit Vector Programming with OpenMP 4.0 SIMD Extensions
By and   |  November 19, 2014

3.3 – Understanding vector length selection

If the vector length (VL) is not directly specified by programmers using the simdlen(VL) clause with a declare simd directive, the compiler determines VL based on the physical vector width of the target instruction set and the characteristic type. This choice generates the most efficient SIMD vector code. For example, on Intel architectures, VL selection uses the following rules:

  • If the target processor has an XMM vector architecture (i.e., no YMM vector support) and the characteristic type of the function is int, VL is 4.
  • If the target processor has Intel AVX (Advanced Vector Extensions) YMM support,
    – VL is 4 if the characteristic type of the function is int (integer vector operations in Intel AVX are performed on XMM);
    – VL is 8 if the characteristic type of the function is float;
    – VL is 4 if the characteristic type of the function is double.

For applications that do not require many vector registers, higher performance may be seen if the program is compiled to a doubled vector width — 8-wide for SSE using two XMM registers and 16-wide for AVX using two YMM registers. This method can lead to significantly more efficient execution due to greater instruction level parallelism and amortization of various overhead over more program instances. For other workloads, it may lead to a slowdown due to higher register pressure. Trying both approaches using the simdlen clause for key kernels may be worthwhile.

3.4 – Memory access alignment

Alignment optimization is important for current SIMD hardware. Given architecture trends of increasing vector register widths, its importance will grow. OpenMP 4.0 provides the aligned(n) clause so programmers can express alignment information. For compilers to generate optimal SIMD code on current Intel systems, n may be 8, 16, 32, or 64 to specify 8B, 16B, 32B, or 64B alignment.

For instance, a good array element access alignment is 16-byte alignment for Intel Pentium 4 to Core i7 processors, 32-byte alignment for Intel AVX processors, and 64-byte alignment for Intel Xeon Phi coprocessors. See the example below:

void arrayref(float *restrict x, float *y, int n, int n1) {    __assume(n1%8=0);    <strong>#pragma omp simd aligned(y:32)</strong>    for (int k=0; k<n; k++) {      x[k] = x[k] + y[k] + y[k+n1] + y[k-n1];    }  }

listing 3.4.1 – A loop example using the align clause.

In this example, the array x is marked as 32-bytes align and the pointer y is marked as 32-bytes align. The memory reference offset n1 is asserted with “mod 8 = 0”. These annotations tell the compiler that all vector memory references are 32B aligned.

3.5 – Struct and multi-dimensional array accesses

The vectorization for scalar and unit-stride memory accesses has been effectively supported by SIMD architectures and modern compilers [1, 2, 5]. However, vectorizing for structure and multi-dimensional array accesses normally results in uses of non-unit strided-load/store and gather/scatter instruction supported in the SIMD hardware to handle non-unit stride and irregular memory accesses.

A practical way for programmers to achieve effective vector-parallelism through explicit vector programming is to convert Array of Structs (AOS) to Struct of Arrays (SOA), change the access order of array dimensions, and then apply SIMD vectorization. For C/C++ array accesses, SIMD should apply to the inner-most dimension; for Fortran array accesses, SIMD should apply to the outer-most dimension. The guideline is to re-shape arrays or to shuffle array access in order to achieve unit-stride memory accesses.

Navigation

<12345678>

© HPC Today 2024 - All rights reserved.

Thank you for reading HPC Today.

Express poll

Do you use multi-screen
visualization technologies?

Industry news

Brands / Products index