While it is always difficult to optimize existing codes, an appropriate methodology can provide provable and replicable results. In this first article in a series of two, we’ll begin by evaluating and validating sources before addressing the critical issue of the efficient use of the compiler. Next month, we’ll focus on the organization of data, the algorithmic implementation and the optimization of parallelization.
Romain Dolbeau – HPC Fellow, CAPS Entreprise.
The subject of performance programming is a complex one, about which many scholarly papers have been published. Deriving from classic introductions such as Chellappa et al. , this article aims at providing an experience-based approach to the newcomers. Its target audience includes all those interested in writing high-performing applications, whether their background includes computer science or not. On the basis of existing codes, we will detail a path leading to performance improvements with minimal effort and in a minimal amount of time. We’ll also cover good practices to ensure the correctness and efficiency of the work done.
Performance is a difficult thing to obtain in programming. Not just because writing fast code is complex, but also because understanding why the code isn’t fast in the first place is often far from obvious. When tasked with improving the performance of a code, developers find themselves with a well-known multi-step agenda that has to be iterated upon:
1) Identify the sections of the code that take up most of the time;
2) Analyze why these sections are so costly;
3) Improve their performance by targeting bottlenecks.
But this conventional methodology, based on Amdalh’s law , is only a very succinct explanation – or a high-level view – of the actual processes involved. It does not express the low-level concerns of the everyday programmer. For performance isn’t just about the code; it is about the entire sequence of decisions and elements that leads from the code to the actual computations, i.e.
1) The chosen algorithms;
2) The specific implementation of these algorithms in the chosen programming language;
3) The organization of the data on which the algorithms will work;
4) The choice of an optimizing compiler to convert the code from source to binary;
5) The third-party libraries on which the code relies;
6) The actual hardware on which the code will run.
Each of these decisions and elements must be taken into account to achieve the goal of optimized performance; ignoring any one can lead to a significant waste of efficiency. And while the process of creating and running codes is sequential through or within each elementary step, it is the entire sequence that has to be targeted at once. Optimizing one step for a given set of states of the other steps does not mean it will remain optimal once these other steps have changed.
This paper suggests a very pragmatic approach built on years of experience in optimizing code – an approach that tries to maximize the efficiency of the programmer’s time. Rather than rewriting the code straight away, a better idea is to start by leveraging all the tools available to maximize its efficiency, and then only to consider modifying it. In this respect, the important points this paper will try to make are the following:
– Reproducibility: we want to be able to reproduce both the results of the execution of the code (fast but wrong is not a good thing), and the measured performance. This is an issue for programs whose performance is data-sensitive, such as convergence algorithms, as several data sets will have to be validated.
– Maintainability: even if ultimate performance is the objective, the code will have to be maintained in one way or another. As obvious as it sounds, from two implementations with similar performance, the easier to explain and maintain is usually the better choice.
– Prioritization: time should be invested where the best results are likely to be obtained. Manually fine-tuning a kernel is fun but very often it is also an inefficient use of time. Once a code section is fast enough for its purpose, there is no point in making it faster.
– Performance: with everything else above in mind, performance becomes a much more manageable goal.
I – EVALUATION AND VALIDATION
I.A – Test cases – The single most important thing when working on a code is to maintain the validity of the results. The second most important thing is to reliably quantify the gain (or loss) of performance of the modified, validated version.
Validating results is always a difficult task. While some codes will have the good property of producing output files whose content depends solely on the input parameters, it is not always the case. Any code involving IEEE754-2008 floating-point operations will suffer from rounding approximations, unless compiled with extremely strict conformance options that prevent any performance gain. Any code involving random numbers (such as Monte Carlo methods) will produce unreproducible results. And not all algorithms are equal with regards to numerical accuracy, a subject covered in details by Higham .
Accordingly, the first step is to establish a test case (or a set of test cases) that will have several good properties for validating the results:
– An execution time short enough that it can be tested on a regular basis;
– An execution time long enough that small mistakes are likely to become noticeable in the output, rather than being masked by numerical approximations;
– A good coverage of the significant portions of the computations, both in terms of amount of code and type of input data (e.g. when propagating energy in a discretized space, the first few steps are usually full of zeroes, while boundary conditions are not reached and thus not tested until many iterations have been run);
– A clear validating procedure to ensure the output is correct.
These aspects should be discussed with the people interested in running the code for production. The second and third point, in particular, are not always easy to achieve. For codes involving random numbers, test cases should be de-randomized, for instance by fixing a seed or by precomputing one or more set of values, as it is usually the easiest way to ensure reproducibility.
The second step is to establish another test case (or another set of test cases), whose purpose will be to check performance. This test case will usually be larger (and longer running) than the validating case. It should also come with a clear validating procedure, in order to trap any residual mistake that might not be noticeable with the smaller test case. And it should be reasonably representative of typical production inputs. This can be very hard to achieve for extremely long and/or extremely big codes. Unfortunately, there is no shortcut here, as representativeness is highly dependent on the type of code.
Finally, never forget that it does not make much sense to try and improve a code with remaining issues. Verifying that the code doesn’t misbehave is therefore an important preliminary step. Compilers can optionally do runtime bounds checking. Tools like Valgrind will check for accesses that haven’t been initialized or that reach beyond properly allocated memory. Any such issue in the code must be dealt with prior to any kind of optimization as it is bound, sooner or later, to result in unreliable behaviors.
 S. Chellappa, F. Franchetti, and M. Püschel, How to write fast numerical code: A small introduction in Summer School on Generative and Transformational Techniques in Software Engineering (GTTSE), ser. Lecture Notes in Computer Science, vol. 5235. Springer, 2008, pp. 196–259.
 G. M. Amdahl, Validity of the single processor approach to achieving large scale computing capabilities in Proceedings of the April 18-20, 1967, spring joint computer conference, ser. AFIPS ’67 (Spring). New York, NY, USA: ACM, 1967, pp. 483–485.
© HPC Today 2020 - All rights reserved.
Thank you for reading HPC Today.