II.C – Vectorization – The next step is to understand whether the compiler properly vectorizes the code. The word “vector” shouldn’t be taken too literally here, as it usually calls back to vector systems (from the Cray 1 onward). In the current era, “vector” refers to the short, fixed-length registers in the CPU and their associated operations (like the SSE and AVX instructions sets in the x86-64 world or the QPX on the IBM BlueGene/Q). There have already been detailed works on the subject of vectorizers, such as Intel’s own . These are of course interesting for advanced users, but might be too detailed for the newcomer to vectorization.
The principle of vector instruction sets is to allow a single instruction to perform multiple operations at once, thus adding potential performance at a low cost. Analyzing why this is a good trade-off would require a long explanation; suffice it to say that nearly all contemporary CPU use such instruction sets. We will detail some interesting differences between them in section D below.
A good starting point to understand vectorization better is SSE2’s mulsd mentioned in section A. In the Intel documentation , this instruction’s full name is “Multiply Scalar Double-Precision Floating-Point Values”. As this description implies, mulsd performs a multiplication on double-precision floating-point values. However, it has a counterpart called mulpd, or “Multiply Packed Double-Precision Floating-Point Values”. Instead of a single multiplication, mulpd executes one per element in the so-called vector. The precise size of the vector depends on the instruction set (SSE or AVX). If we take the example of SSE where vector registers are 128 bits, then we have two double-precision value (64 bits each, as per IEEE754-2008 ) in each register. In other words, each instruction will compute 2 flops instead of one. If we go back to the saxpy function from listing 1, we can update the assembly version to exploit this new instruction, as shown in listing 3.
Obviously, this code will not work if the number of elements in the vector is not a multiple of two, as we compute two values for each iteration (note that it was hand-modified from the automatically generated code in listing 2 as the compiler vectorized code, including all the necessary checks, is not suitable for an introductory example). But it does illustrate the point that with the same number of instructions, we perform twice as much work. On such a simple loop the compiler has no issue with vectorization. It will even mention it in its output when using the -vec-report3 option, as can be seen in listing 4.
However, most loops are not that simple, and this is where understanding the compiler is important. A modern compiler will inform you about its successes and failures, but what these actually mean is not always obvious. Let’s take a very concrete example with the Hydro code. Hydro is a mini-application built from RAMSES. It is used to study large-scale structures and galaxy formation. This code includes several variants, each of them designed to take advantage of a certain kind of hardware: a standard C version using OpenMP directives and MPI calls; a version using OpenACC directives to exploit accelerators; an OpenCL for the same purpose, etc. What we are going to look at is the riemann function, which exists in two radically different versions: the original in the OpenACC version of the code and an updated and vectorization-friendly variant in the C version. Compiling the original, untweaked version results in what is shown in listing 5. The loop nest of interest is the one referenced at line 102, 107 and 138 (see its structure in listing 6). The compiler output shows that it failed to vectorize the innermost loop at line 138, because it couldn’t be certain that all iterations were data-independent from each other. The compiler is absolutely right: this innermost loop is a convergence loop, and each iteration is predicated on the results of all the preceding iterations. Therefore, it cannot be easily vectorized. As for the other two loops, the compiler doesn’t even try to vectorize them since it could not process the innermost loop, a requirement for the current version of the vectorizer.
There is an easy way to force the compiler to vectorize the code as it is: if we replace the upper bound of the innermost loop (Hniter_riemann) by a hardwired value like 10, then the compiler can fully unroll the innermost loop, making the narray loop the new innermost loop. In practice, the compiler still fails as it still can’t figure out that the two external loop are fully parallel, but a directive exists to help it out: #pragma ivdep. Using it on both these loops allows vectorization of the code (with slightly modified line numbers), as show in listing 7. However, this does not really help: the code is now wrong, not to mention that hardwiring a value is not an acceptable practice. To be able to properly exploit the hardware in this critical function, the code needs to be modified so that the compiler can vectorize it with the original semantic. For this particular code, the author decided to switch the convergence loop (the problematic one at line 138) and the narray loop at line 107. Of course, this implies a complete reorganization of the loops (and the use of additional temporary arrays). The new structure of the code is shown in listing 8, while the result of compiling it is given in listing 9. Note that the presence of an extra “SIMD” in the message is an artefact of the directives required to tell the compiler that the loop iterations are data-independent.
There is a lot of reasons for the vectorizer to fail, and this paper is not the place to explain them all. If some compiler output messages are self-explanatory, unfortunately some aren’t. That being said, a few basic rules can help with identifying and fixing issues.
• The Intel compiler only tries to vectorize the innermost loop. Small innermost loops can and should be unrolled (automatically, by a pragma or manually) if the number of iterations is known. If it is not known, the situation is more difficult. It might be possible to invert the loop with a larger, parallel outer loop and the addition of temporary arrays.
• Except for some types of reduction (arithmetic accumulation in a scalar variable), the Intel compiler vectorizes parallel loops. If the compiler complains that there is a vector dependence, check that there really isn’t, and use the appropriate directives to help with the compilation.
• Conditionals in the loop nest might prevent vectorization, and make it less efficient anyway. It’s often a good idea to move iteration-independent conditions out of the loop nest.
While the compiler will tell which loops have been vectorized, it is not the only condition for successfully using vector instructions sets. Compilers will often generate multiple variants of a loop depending on various parameters (number of iterations, alignment of data arrays…) and there is no guarantee the vector version will always be picked up at runtime. Hardware counters inside the CPU can be used to validate the results during execution. Tools such as VTune or Likwid can be very useful to validate that the vectorizer has been successfully exploited.
II.D – The target machine – We mentioned it in previous sections, there are two main vector instruction sets for the x86-64 platform: SSE and AVX. SSE was originally meant for single-precision floating-point and integer data. Double-precision floating-point was added in SSE2, and subsequent extensions added new instructions for various purposes. They all have in common a 128-bit register size, which is enough to hold four single-precision values, or two double-precision values, in the XMM registers.
The AVX instructions set was introduced using a completely new instructions encoding scheme (called VEX). While this does not really concern most programmers, AVX’s main selling point was the doubling of the register width to 256 bits, for eight single-precision or four double-precision values. In the original AVX, only floating-point values can be worked on in the 256 bits YMM registers. The lower half of these YMM registers is aliased with the XMM registers. As you can guess, going from SSE to AVX instructions in the same code causes a small performance penalty.
One of the most overlooked feature of the new AVX instruction set is the VEX.128 subset of instructions. These instructions only work on the lower 128-bit part of the YMM registers – the part aliased with XMM registers. While they may seem quite redundant with the SSE instructions, they actually offer a very important upgrade: whereas SSE instructions have mostly two operands, AVX instructions have three. listing 10 is an excerpt from the Intel documentation  in which the SSE version stores its result in source operand xmm1, while the VEX.128 version destroys neither of its source operands. Whenever both sources are reused, this saves a register-register movapd instruction. This may not look like a lot, in particular in numerical codes where one of the operands will be one-use only most of the time. But the VEX.128 instructions are not limited to the floating-point instructions; many integer instructions have been re-implemented as well. They cannot access the upper 128 bits of the 256 bits YMM registers, but they do have access to the new three-operands format.
A code that benefits greatly from this is the Keccak algorithm by Bertoni et al., the winner of the NIST competition to create SHA-3. The “SIMD instructions and tree hashing” implementation uses integer SSE instructions to implement two 64-bit instances of the algorithm simultaneously. The implementation is done using C intrinsics rather than assembly. The instructions in use are mostly logical xor‘s, logical or‘s and shifts (used to implement the rotation of a 64 bit value). A brief extract implementing a rotation is shown in listing 11 for SSE and listing 12 for AVX. Extra copies to preserve the input values are required by the algorithm, which reuses the input data several time. Statistics on the partially unrolled loop show that the number of data movement instructions has gone from 396 (35% of the 1132 instructions) to 111 (13% of the 837 instructions). All the remaining moves are memory accesses, none of them being register-register. Speed measurement shows an improvement of more than 20% for short hashes. As the only cost for high-level language or intrinsics-based code is to use the -mavx options (or similar), switching from SSE to AVX can offer a nice improvement for a very small cost.
Next month, we’ll take a detailed look at how data structures can impact both code performance and algorithm complexity – with a special focus on the optimization of parallel codes. In the meantime, happy (performance) programming!
© HPC Today 2019 - All rights reserved.
Thank you for reading HPC Today.