Processor evolution: what to prepare application codes for?

By François Bodin, Henri Calandra and Alain Refloc'h | April 28, 2014

This paper presents a synthesis of the ORAP workshop held last December 2013 and hosted by Total at Paris – La Défense. This workshop aimed at identifying the impact on codes of the evolution of processors technology. One of the goals of this meeting has been to identify best practices that may help future code development in the context of high performance computing.

• François Bodin (IRISA)
• Henri Calandra (Total)
• Alain Refloc’h (ONERA)

The following persons have contributed to the content of this document:
Jean-Michel Alimi (Observatoire de Paris), Guillaume Colin de Verdière (CEA DAM) – François Courteille (NVIDIA), Quang Dinh (Dassault Aviation), Romain Dolbeau (CAPS Entreprise), Yvan Fournier, (EDF), Laura Grigori (INRIA Rocquencourt), Thomas Guignon (IFPEN), Michel Kern (INRIA), Yann Meurdesoif (CEA), Raymond Namyst (INRIA Bordeaux), Serge Petiton (Université de Lille 1, Sciences et Technologies), Philippe Ricoux (Total), Philippe Thierry (Intel).

The Power Wall [1, 2], which refers to the electric energy consumption of a chip as a limiting factor for processor frequency increase, has led to the design of new parallel multi-core and many-core architectures. In this context, the efficient feeding of parallel computing units must be seen as a key consideration when dealing with legacy code evolution and a central issue when developing new ones. As new architectures are proposing instructions working on wider vectors, memory bandwidth remains unable to serve the computing unit at a sufficient rate. We are now very far away from the 24 Bytes per Flop of the Cray 1. Data locality has been a concern for years and is becoming even more critical with scaling up in mind.

Table 1 lists typical system characteristics and their relationship to Exascale objectives as an illustration of the trend in future HPC machines. While Exascale is an extreme evolution scenario, it shows the scale of parallelism codes will have to exploit to remain efficient in the next 10 to 20 years. For a few specific cases, exascale runs will be needed: that is commonly referred to as “capability computing”. However, it is expected that in most situations, the efficient use of resources will be achieved via high job throughput, aka “capacity computing”. In all cases, application codes will have to exhibit massive parallelism at the node level.

	CPU	Intel Xeon Phi	GPU
Frequency (Ghz)	1 – 3.2	1	0.7 – 1.5
Number of cores per processeur	8 – 32	61	500 – 900
Number of nodes (thousands)	10 – 100+	1 – 50	1 – 20
Peak Pflops	20.1 – Sequoia	48 (+6.9) – Tianhe2	24.5 (+2.6) – Titan
Linpack Pflops	17.2	33.9	17.6
Factor to reach Exaflop (10¹⁸)	x58	x29.5	x56.8

The evolution of processors deeply impacts code development and numerical methods in the perspective of scaling [3, 4]. Many parallel codes are implemented using MPI where one process per CPU core has been the best tradeoff for years. But this strategy is reaching its end as it is unable to efficiently exploit current memory systems – which explains why mixing OpenMP and MPI [5, 6] is becoming more and more popular.

Updating legacy codes in this context is a particularly difficult challenge: they frequently have more than one million lines gathered during decades and mixing strata of languages (F77, F90, C, C++…). What’s more, many codes have been primarily designed for serial execution, with an operation count drop objective in mind.

In an old house, when replacing one wall tile, you may end up with the entire wall torn down. This metaphor aptly illustrates many current HPC situations where code modularity has been compromised by the successive addition of features. Legacy codes may also suffer from complex validation issues that slow down the profound changes required for achieving efficient parallel execution. Thirdly, the project of moving codes toward parallel computing is bringing important and orthogonal topics such as software engineering techniques. And last, the lack of visibility on forthcoming programming standards and processor architectures considerably hampers code mutation planning. A good summary of this context has been given during the recent workshop on Big Data and Extreme-scale Computing (BDEC) hosted by Bill Harrod from the Department of Energy (DOE) [7]: “The world has changed and uncertainty threatens the future of computing. Technology is changing at a dramatic rate. The IT marketplace is also changing dramatically with PC sales flattening and handhelds dominating growth. Not to mention HPC vendor uncertainty and data volume and variety explosion…”

It should be noted that the first factor driving an application is its ecosystem and its deployment constraints. It is always easier to justify new user-oriented features rather than a technology evolution. However, a lack of anticipation may drive many codes to their graves.

We do recognize that an application code is only a component of a global computing infrastructure that requires I/O, visualization, sensors, etc. This document is accordingly a collective attempt to list the consequences of the evolution of computing technology on scientific application developments. It has been voluntarily kept short and does not have other ambitions than providing a set of pointers toward the right development strategy to adopt. It is organized in three main parts: Section I lists technical issues to be taken into account while developing parallel codes; Section II shortly addresses algorithms and numerical issues; Section III surveys current best practices.

[References]

[1] S. H. Fuller and L. I. Millett. Computing performance: Game over or next level? Computer, 44(1):31–38, 2011.

[2] D. D. Journal. The free lunch is over: A fundamental turn toward concurrency in software. 2009.

[3] G. M. Amdahl. Validity of the single processor approach to achieving large scale computing capabilities. In Proceedings of the April 18-20, 1967, Spring Joint Computer Conference, AFIPS ’67 (Spring), pp 483–485, New York, NY, USA, 1967. ACM.

[4] J. L. Gustafson. Reevaluating Amdahl’s law. Commun. ACM, 31(5):532–533, May 1988.

[5] Pierre-Francois.Lavallée. Programmation hybride mpi-openmp.

[6] J. Zollweg. Hybrid programming with openmp and mpi.

[7] B. Harrod. Big data and scientific discovery, presentation at bdec fukuoka, japan. 2014.