An Introduction to Performance Programming (part I)

By Romain Dolbeau | February 10, 2014

II – USING THE COMPILER

Most developers see the compiler as a necessary evil, except for those who have forsaken it completely (or so they think) for so-called interpreted languages. Obviously, these languages cannot be run directly on the hardware, and therefore a translation layer is still present. While it could be a virtual machine, for performance reason it is often a just-in-time compiler. Which, as its name implies, is still a compiler, but one that tries to be fast rather than efficient and on which the developer has very little control. For the vast majority of codes requiring all-out performance, the compiler will be offline and highly tweakable. The examples in the paragraphs below will be related to Intel’s C/C++ and Fortran compilers, as they are probably the most ubiquitous in high performance computing, but most comments are applicable to other languages and other compilers as well.

II.A – The hardware problem – To understand why compiling the programming language into machine code via an offline compiler is so important, we first have to mention a few things about hardware considerations (a good starting point on the subject is Hennessy-Patterson [5]). A computer CPU does not execute sophisticated, high-level code. It only executes very basic instructions, of which three classes are really important to us: memory operations, computations, and control flow operations. Any computational loop (or even non-loop kernels) will be essentially a set of control flow instructions surrounding a sequence of memory loads, operations, and memory stores. Anything else will be overheads that are useless to the higher-level algorithms but eat up CPU time. The list include in particular function calls, register moves, register spilling, virtual function handling, indirect references, etc.

The entire point of a good optimizing compiler is to both eliminate as much of these overheads as possible and to produce binary code that will be as efficient as possible on the target architecture for the computational parts of the code. Any spurious instructions in the code flow will degrade performance, and much more so when the code is highly efficient. Let’s take a synthetic example with a fairly naive x86-64 version of the BLAS saxpy function, first in a basic C implementation (listing 1) then in its assembly version (listing 2).

One might consider that the two lines with incq (line 11) and cmpq (line 12), dealing with control flow, are overheads that should be dealt with. Actually, they’re inexpensive: in the context of an otherwise idle integer unit, any modern out-of-order CPU core will execute them at the same time as the useful floating-point instructions. The real killer here is the sequence starting with the load instructions (line 7) as they will take hundreds of CPU cycles if they miss the various caches, followed by the producer-consumer dependency between the multiplication (line 8) and the addition (line 9), and between the addition and the store (line 10). Obviously, the literature is full of solutions to these problems. Loop unrolling will mask remaining overheads from the control flow instructions. It will also, along with software pipelining, try to mask the producers-consumers dependencies. These are textbook considerations, and most if not all of them can be taken care of by the compiler (or indeed the BLAS library itself).

The real point of this example is to illustrate what happens when the compiler does its job properly. While the vectorizer and the aforementioned optimizations will improve the performance of this code, any additional overhead will degrade it. The worst-case scenario is the introduction of additional, dependent load operations in the loop. Any of those (such as accessing an object before accessing a field in it, or an indirect access through a pointer to pointer) will add extra, potentially large latency to the dependency chain. On Sandy Bridge, a mulsd operation has a latency of 5 cycles. A movsd from memory can consume from 3 cycles (best case scenario, hitting the L1 cache) to 230-250 cycles (hitting the main memory of the same NUMA node on a dual-socket Sandy Bridge) to 350-370 cycles (hitting the main memory of the other NUMA node) to over 600 cycles (hitting the most remote NUMA node in a quad-socket Sandy Bridge).

That is why analyzing the assembly code is so important, for it is the binary code that will eventually be executed. One cannot reasonably hope to optimize the performance of a program on a processor without understanding how this processor works.

II.B – General optimization – The basic rule in exploiting a compiler is to start by letting the people who created it do the job. Compilers include a lot of optimizations. Many of them are more or less mandatory because they are cheap (in terms of compilation time) and offers large gains. Some of them are additional extras that are very often used, as they are usually worth the extra compilation time. Finally, some can be very costly, and can have wildly different effects on performance depending on the specific code involved.

The first job is therefore to choose a compiler. The easiest choice is the machine vendor’s, as it is likely to be the most suited to the target hardware architecture. But third-party compilers are also of interest, precisely because of their strengths and weaknesses. If multiple compilers are available, try them all. It will help you identify bugs and approximations in the source code and, eventually, lead to the selection of one compiler over the others. This, however, will not be a definitive choice: changing parts or all of the code to improve it might help one compiler more than another, to the extent that the former second choice might become the new first choice. Also, be sure to use the latest available version, barring some fatal bug in it.

The optimization options to try first are those prefixed with -O and suffixed with a number (the larger the number, the larger the set of optimizations applied). All codes should work at all optimization levels. If it is not the case, either there is a bug in the compiler or there is a problem with the code. While it is common to blame compilers for failure at high optimization levels, the culprit is generally the code itself. Quite often, the code exercises corner cases or unspecified behaviors of the language, leading to an apparent unreliable behavior on the part of the compiler. In this case, the first step is to ensure reproducibility at all general optimization levels, either by fixing the code or by proving there really is a bug in the compiler (and having the compiler’s vendor fix it, of course). Each and every time a new set of options is used, the results of the code should be checked for conformance.

Then, compilers such as Intel’s have inter-procedural optimizations that try to optimize the code across function boundaries rather than within each function independently. In recent versions, these optimizations are active by default inside each compilation units (aka files). The -ipo option enables these optimizations across compilation units. Instead of compiling each file independently (creating “.o” files), the compiler simply creates stubs and then, at link-time, it compiles everything in one go. The downside is that the entire code has to be recompiled each time anything changes. The upside is that all the information is available to the compiler, including hard-coded values (array bounds, constants…) and call trees. This allows the compiler to do a much better job: for instance, if it is aware of the number of iterations in a loop, it can unroll more efficiently. If a scalar parameter to a function is known to be a constant, the compiler can use that knowledge. For example, nice constants such as 0 and 1 are identity elements for the addition and multiplication respectively; using them as such prevents redundant computations.

Another must-try with recent Intel compilers is the set of inlining options. Normally, a compiler will try to inline a certain number of functions but will use heuristics to prevent an excessively large code size and/or compilation time. However, this often limits its ability to merge loops and/or functions, and to eliminate redundant computations and/or memory accesses. A particularly aggressive set of options is

-inline-forceinline -no-inline-factor
-no-inline-max-per-compile
-no-inline-max-per-routine
-no-inline-max-total-size
-no-inline-max-size -no-inline-min-size

which, combined with -ipo, tells the compiler to merge as much of the code as it can into computational functions, regardless of compilation time, its own memory usage and the size of the resulting binary. Although these options combined can be really aggressive in some contexts, many codes will actually benefit a lot from them. For instance, when applied to the Qiral QCD code [6], the improvement is superior than 2X with version 13.1.1.163 of the Intel compiler.

Those are general guidelines for the most common options; each compiler has its own set of advanced settings, some of which can be useful on some codes. As long as the results of the code remain valid, any set of options can be used. Again, if “the compiler broke the code”, then there is a problem somewhere that must be tracked down. Otherwise, the problem may reappear with a different compiler, a different machine or, worse, a different data set.

[References]

[5] J. L. Hennessy and D. A. Patterson, Computer Architecture: A Quantitative Approach. San Mateo, CA: Morgan Kaufmann, 1990.

<1 2 3 4 >

Navigation