I – Technical issues for parallel codes
Implementing an HPC code aims at getting as close to the machine peak performance as possible. Table 2 gives a set of efficiency examples achieved with the Linpack benchmark  – a highly favorable case. The diversity in ratios results from a series of tradeoffs between architecture choices and investments in software development. Today’s heterogeneous architectures (e.g. CPU+GPU) further expand the design choices. As a consequence, moving to a new system requires a more integrated approach to achieve efficient execution.
|Computer||Top500 Linpack Efficiency |
|K Computer (Sparc)||93.17%|
In practice, the development of parallel codes (legacy or new) requires the consideration of many parameters in order to produce an efficient parallel execution capable of surviving several generations of systems. Whereas the typical lifespan of a supercomputer is 5 years, large codes are typically in production for 20 years or more. In other words, applications have to “survive” at least 4 very different machines (which implies multiple OS and multiple compilers, etc.). This estimation is however a lower bound, as many important codes are being used on systems external to the development team. The latter point must drive the choice of development standards, programming environments, code structures etc., but these choices are difficult to make and have critical consequences. History shows that successive adaptations to parallel technologies have been painful and expensive (e.g. Fortran+Vector to Fortran+OpenMP to Fortran+MPI). The complexity of the issue has its root in the impact of computing technology on algorithms and numerical methods (see Section II). In this regard, it is important to remember that the cost of a final line of code averages between 10 and 100 euros depending on the complexity of the algorithm being implemented and the counting method [9, 10].
For instance, at the 32nd ORAP Forum , Jack Wells from ORNL indicated that up to 1-2 person-years were required to port each code from Jaguar (2.3 PF, 2010) to Titan (27 PF, 2012). The estimation is that possibly 70-80% of developer time are spent in code restructuring, regardless of whether OpenMP, CUDA, OpenCL or OpenACC is used. Each code team must make their own choices of picking and choosing among those many languages and paradigms, and based on the specificities of the cases this may lead to different conclusions for each code.
Everyone also agrees that homogeneous programming is at an end. The vast majority of current systems mix multiple architectural techniques to achieve high performance: vector instructions, multiple cores, simultaneous multithreading (SMT) and/or accelerators. Each of these is not only questioning the coding but also the numerical methods that must exhibit all kinds of parallelisms to achieve high efficiencies (Macro ∼ MPI, Meso ∼ thread, Micro ∼ vector-data parallel). For instance, not using vector instructions (e.g. AVX) intrinsically limits achievable performance, which typically wastes 50% to 85% of computing resources depending on the systems. This effect is not expected to be alleviated with future processors, if only because the sizes of vector registers tend to keep increasing. Heterogeneity equally impacts numerical libraries which uses are not currently transparent depending on the hardware they are optimized for (e.g. CUBLAS  vs MKL ). Sophisticated runtimes techniques  are also necessary to adapt to the configuration of the machine. According to memory and computing resources allocation, for instance due to energy constraints, applications need to adjust their behavior regularly to fit machine arrangement. Therefore, runtime API standardization remains another issue.
At the core of all code development choices is the data structure organization. This is a major design or restructuring problem, which must be solved in a holistic approach including data locality, efficient asynchronous execution, pre- and post-processing, checkpoint restart, etc. To complicate things further, these issues must take into account the fact that moving data around systems will clearly become more and more expensive (time- and energy-wise). Although frequently overlooked, the IO sub-system is indeed becoming a major platform cost. Over time, this change of balance will strongly influence the current organization of pre- and post-processing with the core computation . Traffic limiting is therefore crucial. In situ and in transit analysis can help with this, as well as application-generated hints about how the data are structured. This is actually another ground where Big data and HPC could be combining .
The process of application development is also questioned by the current evolution of practices. The V cycle has shown its limits . From the point of view of code engineering, scientific applications bear few specificities except for a strong code optimization need. In many cases, agile processes [19, 20] and corresponding tools (e.g. version management ) could result in improved development efficiency within the HPC community. This is particularly true for Exascale applications where a co-design approach looks necessary .
Redesigning codes and their corresponding data lifecycle (pre- / post-processing, visualization…) requires a larger set of skills than is usually gathered in HPC projects. The complexity and scope of changes needed to take into account massive parallelism, I/O balances, etc. can only be fully addressed by gathering in a same place (physical or virtual) different scientific communities. This is not a minor organizational challenge. The approach must be adapted to each application development ecosystem but these vary greatly: an open source distributed scientific community cannot be organized in the same way as a unique large organization setup on a small territory.
II – Algorithmic considerations
In a number of cases, achieving scalability requires going back to the basics, i.e. physics, instead of forcefully trying to adapt legacy codes. The tradeoffs made over the last decades that led to decrease the amount of computation have frequently introduced complexity that is in contradiction with efficient parallelization. These tradeoffs may be reviewed in the light of decreasing FLOPs cost vs strongly increasing Byte/s cost. Physics, numerical method and parallelization are strongly intertwined and must be reconsidered together, with parallelization especially constraining what can be simulated.
One should also keep in mind that performance increases may not come only from the hardware but that algorithmic progress has afforded, in some cases, at least as much gain as that coming from machine evolution over the past 25 years . Globally, the impact of technology improvement on algorithmic choices shouldn’t be underestimated. This is true of many of its aspects, especially regarding the balance between numerical quality and algorithms scalability (implicit vs explicit methods, with a small time step, acyclic graphs…). The choice of discretization also has consequences over the whole software architecture (one example being the use of evolving meshes to improve the regularity of computations). More subtle methods may not scale easily (Lagrangian methods with fixed number of DOFs vs ALE with variable number of DOFs, Euler vs AMR), but that does not mean that they should be forsaken altogether.
The path towards Exascale is without a doubt a new gold rush for computational scientists. Specifically, research on new communication-avoiding algorithms is a track that should be pursued and amplified [23, 24, 25].
 S. McConnell. Software Estimation: Demystifying the Black Art. Microsoft Press, Redmond, WA, USA, 2006.
 NVIDIA. CUBLAS Library User Guide. nVidia, v5.0 edition, Oct. 2012.
 Eesi2 – xyratec working document – wg 5.1, 2013.
 F. P. Brooks, Jr. The Mythical Man-month (Anniversary Ed.). Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1995.
 K. Schwaber and M. Beedle. Agile Software Development with Scrum. Prentice Hall PTR, Upper Saddle River, NJ, USA, 1st edition, 2001.
 D. A. Dewi, E. Sundararajan, and A. S. Prabuwono. The adaptive agile in high performance computing: Case study on parallel image feature extraction. Sci.Int. (Lahore), pp 1045–1052, 2013.
 D. Keyes, P. Colella, T. H. Dunning, and W. D. Gropp. A science-based case for large-scale simulation, volume 2, Sept. 2004. DRAFT, Office of Science, U.S. Department of Energy.
 L. Grigori, J. Demmel, and H. Xiang. Calu: A communication optimal lu factorization algorithm. SIAM J. Matrix Analysis Applications, 32(4):1317–1350, 2011.
More around this topic...
© HPC Today 2019 - All rights reserved.
Thank you for reading HPC Today.