I.B – Measuring performance – While measuring performance seems at first glance quite simple, it is actually a fairly complex subject as well. What should you measure? And how should you measure it? When you’ve come up with reasonable answers to these two questions, there remains the difficulty of ensuring the consistency of measurement from one run to another.
What to measure is logically the first decision make. Eventually, something akin to the time command is going to be used: when the code is run in production, it is the entire execution time from beginning to end that is going to matter to the consumers of the code. It’s also going to matter to the programmer, as it is the time taken by any validation run or full-scale measurement. If this metric is obviously important, others are, too. After a profiling has been done (see section I.C below), the code will be seen as a sequence of stages: the prolog or initialization phase (usually an O(problem size)), one or more computational steps of varying complexity, and the epilog (also usually an O(problem size)). The costliest of these three steps will usually be the computational part, but it is not always the case. Prologs and epilogs generally consists of I/O and data movements, which are outside the scope of this article. If they dominate the execution time, either the code is not amenable to improvements, or the test case is too small for a significant measurement.
Gustafson’s law  helps us here, as for most codes there should be a problem size big enough that the computation part dominates the execution time. Gustafson’s aw is an important consideration: even if the test case used during the optimization phase doesn’t offer an overall improvement (because of significant prolog and epilog times combined with Amdhal’s Law), larger production cases might since the fraction of time spent in the computational phase will be higher. It is therefore important to keep track not only of the overall execution time, but also of each main phases of the code. An X2 improvement in a computation phase representing 50% of the test case runtime is an overall X1.33 improvement of the whole code, but an X2 improvement in a computation phase representing 98% of the production case runtime is an overall X1.96 improvement.
How to measure it is the second step. The problem here is that time scales on computers can vary by orders of magnitudes. On a 2.5 GHz CPU, a single CPU cycle takes 0.4 ns, or 4×10^-10 s.
Execution times that go beyond the hour take on the order of 10^14 cycles, of which only the most significant digits make sense. Conversely, a small function of a few thousand cycles would take on the order of 10^-7 s, a precision greater than most system calls would ensure in practice. It is a good idea to keep these orders of magnitudes in mind when choosing the proper timer. For most human-scale measurements (tenths of seconds or more), the gettimeofday function is perfectly adequate. For hundreds of microseconds or less (millions of cycles or less) measurements, use _rdtsc (in x86-64 compilers) to get the current CPU cycle count. For time scales in between, things are less clear-cut: gettimeofday having a theoretical precision of a microsecond, it should be adequate. But since _rdtsc returns a 64 bits integer, measuring billions of cycles is also an option.
Consistency of measurement might be the most overlooked aspect of the problem. Codes seldom run on their own on a computer; they operate in an environment that includes the operating system, various housekeeping programs running permanently or regularly, and potentially other users. Whenever possible, reliable measurements should be made on an otherwise idle machine – one where very little is running on top of the operating system. Any other code will influence the results, by consuming memory bandwidth, inter-cpu bandwidth, cache space, I/0 bandwidth, or even by simply raising the power consumption of a CPU and affecting the thermal behavior of the others.
That is not all; the operating system behavior can drastically affect the performance of the CPU. Modern machines have multiple cores, each of them with its private L1 cache (and often private L2 caches). Whether or not the operating system let the code run on the same core for the entire run will greatly affect performance, as each move from one core to another will force the code to reload the local caches. Moving from one socket to another has an even greater effect: not only the shared intra-socket cache (L3 on the Sandy Bridge architecture) will be reloaded, but NUMA effects will happen as well. To keep that under control, the Linux operating system exposes commands to pin or lock a process on a subset of cores, to specify which NUMA memory domains to use, and so forth (see for instance numactl). Understanding the exact topology of the machine is also advisable, using tools from e.g. the hwloc library. When a code has already been parallelized, many tools have the ability to automatically enforce the pinning: the KMP_AFFINITY environment variable in the Intel OpenMP implementation, the GOMP_CPU_AFFINITY environment variable in the GNU OpenMP implementation, the -binding option in the Intel MPI library, and so on. The golden rule here is that ensuring a consistent placement of processes and threads in the machine ensures reproducible measurements.
I.C – Profiling – Profiling is the process of evaluating the relative cost of each part of the code. Without proper profiling, there is no way to tell which part of the code should be improved. No matter how convincing is the argument that function XYZ is the most important in the code, only an objective assessment should be trusted. Intuition, past experience, alternative implementations and test on completely different hardware platforms do not represent the current status of the code on the currently available hardware.
There are two main classes of profiling: instrumentation and sampling. Instrumentation adds instructions inside the binary to dynamically measure the relative use of each function. While comparatively reliable for creating call trees (how each function calls each other) and call counts, it has the bad property of adding overhead to the code and potentially altering its behavior. Sampling takes an unmodified code, and checks its running at regular intervals to evaluate the relative importance of each part. While much less intrusive, it can sometimes miss short but important parts of the code, or have weird side-effect due to the sampling interval. Neither is strictly better than the other, so both types should be used. Tools like gprof, Intel VTune or callgrind are all useful, to the extent that all available tools should be used, not just one of them. Each has its strengths and weaknesses, and it is the combination of their results that gives a clear picture of the code’s dynamic behavior.
In an ideal world, profiling should be done on nearly production-ready code, i.e. a highly optimized binary. But more often than not, optimizations obfuscate the code to the point that profiling results are not very helpful. In difficult contexts, optimizations (and inlining in particular) should be “dialed down” step by step, and the consistency of profiling results checked accordingly until meaningful numbers are obtained. For instance, if the original results indicate that the do_the_computations function does 95% of the work, then subsequent results with less inlining should show the same order of magnitude for the sum of the time periods spent in do_the_computations and its call tree (all the functions it calls, and all the functions the latter call, and so on). In case of inconsistent results, you must absolutely understand why, and identify which optimization changed the code’s behavior so drastically that the profiling results change. It might become important later to understand how to improve performance.
Profiling shouldn’t be viewed as a one-shot operation. Once the costlier parts have been identified and improved, profile your code all over again to get a new, up-to-date picture. Improvements in one part will make the other parts comparatively more important, but side-effect such as new cache behavior may alter their performance, for better of for worse.
 J. L. Gustafson, Reevaluating Amdahl’s law in Commun. ACM, vol. 31, no. 5, pp. 532–533, May 1988.
© HPC Today 2019 - All rights reserved.
Thank you for reading HPC Today.