Processor evolution: what to prepare application codes for?

By François Bodin, Henri Calandra and Alain Refloc'h | April 28, 2014

III – Recommendations and best practices

This section gives an overview of the best practices identified during the workshop. It is important to note that these practices are to be adapted to each context and ecosystem. One of the lessons of the workshop has been to highlight a number of application development processes, going from large spatially distributed scientific communities to integrated project teams.

Validation – Validation must be implemented as a continuous process as recommended by software engineering techniques (and for instance implemented by frameworks such as Hudson [26]). It must ensure reproducibility of results without requiring their bit to bit comparison, as it is usually extremely detrimental to parallel execution efficiency to enforce bit identical outputs. Consistency and verification can also be efficiently performed using the appropriate visualization of the computation. This issue is a prime consideration when migrating or designing a code.

Data Locality – Data locality and volume are critical concerns. The internal data structures must be designed so as to facilitate adaptation to different target computer architectures. For instance, it is desirable to allow array of structures to become structures of array (and vice-versa) depending on the compute node organization. Serializable data structures are also preferable since at one point they may have to migrate to a different storage unit (e.g. accelerator memory, NVM…). C++ templates can be one of the implementation technique but this topic is controversial due to the added complexity in code maintenance.

Scalability – Collective operations (e.g. Reduction) do not scale well; consequently, they should be avoided whenever possible in favor of neighbor data exchanges. The choice of a solver is crucial here and must be carefully considered. This choice should preferably be made globally when designing numerical scheme and data structures.

Programming Languages – The choice of a programming language must first be made according to programmers’ background. In HPC contexts, only C, C++ and FORTRAN (which, contrary to many beliefs, is still a pertinent option) are universally supported. Other new languages look promising (e.g. PGAS) but still lack the critical mass as well as long term visibility. Engineering considerations (e.g. portability, persistence, efficiency, libraries…) should be the ones guiding the choices in this matter.

In some cases, the best implementation relies on global frameworks such as Arcane [27] which integrate computing parts written in either C++ or FORTRAN. However, combining languages in a given application requires careful thinking, all the more as efficient interoperability calls for the design of internal interfaces that need be stable over time. Mixing skills (e.g. computer science and numerical analysis) is one of the main conditions for the use of multiple languages. However, the number of different languages must be kept as limited as possible to prevent issues related to huge builds.

Domain Specific Approaches – Domain specific approaches [28] are being increasingly considered as they provide a level of abstraction appealing to scientists. They also allow the use of underlying sophisticated implementations generally based on fine tuned specific runtime and compute libraries. This approach can have the benefit of bridging communities but they may result in process or methodology disruptions.

At this point, embedded DSLs (Domain Specific Languages) seem to be an attractive tradeoff. This technique consists in embedding a DSL in a general purpose host language like FORTRAN, the former providing extra semantic information and/or code generation strategy. The advantage of this approach is to keep a reference code in a standard language. Directives languages [29, 30] belong to this category. Some implementations are based on C++ expression templates [31, 32]. However, in some cases, this alone is not enough to provide the necessary high-level algorithmic abstraction, mainly because the data model is the one of the host language.

Development Process and Infrastructure – Knowing that the cost of a bug increases with the lateness of its discovery, it must be emphasized that poor development practices tend to break the structure of the codes and make them more difficult to maintain. This is not new, but implementing massive parallel execution makes code development so complex that it requires a robust and well organized development methodology. Therefore, the development process must integrate software engineering best practices such as:

1. Control using automatic tools (e.g. continuous integration);

2. Configuration management;

3. Performance measurement integrated inside the code;

4. Unit testing;

5. Non-regression testing;

6. etc.

Fault Tolerance – Fault tolerance as envisioned for Exascale systems is not presently an important consideration. However, the increase in I/O costs requires strategies to minimize the volume of data needed for implementing a checkpoint restart technique [33, 34, 35]. Application-specific techniques, rather than systems tools, are the only option capable of scaling in the long run.

I/O – Data management and I/O performance are prime considerations in the design of applications. However, they shouldn’t be considered from a technology standpoint only. Basic HPC application design requires the identification of tradeoffs between in-situ vs. ex-situ processing and the selection of data formats, format changes, access policies, data relocation schemes, etc. These tradeoffs are indeed driven by technology and performance but also by the ecosystem exposed to the researchers.

For instance, to tackle I/O performance issues, it is necessary to consider I/O operations and computations overlapping. Some cores can be fully dedicated to I/O to make sure that a true asynchronous execution is capable of flattening data transfers up to the maximum servers’ capacity. The ability to implement such techniques depends on a well designed data structure, code organization and sizing of the various storages.

Pre- and Post-processing Integration – Pre- and post-processing may frequently become the application bottlenecks. To prevent this, they must be integrated in the global design process, particularly in the light of the full data cycle. More specifically, in-situ data analysis may allow for an important decrease in the volume of data to transfer out of the machine [36, 37]. The scientific discovery process must be particularly well understood to design a long-term solution.

Code Architecture – Code architecture strategies and policies must be defined and enforced, for architectural flaws and their dissolution over time have expensive consequences. When architecture designing, the following points are of the essence:

1. Plan structure modularity carefully;

2. Provision for easing the modification of the fast changing parts in the code;

3. Provision for easing mixing technologies and skills among the application development teams;

4. Use external APIs to improve usability and internal APIs to enforce structured development and best practices;

5. Provision the development of a machine-specific, highly optimized implementation together with more lasting generic versions. The chosen architecture must allow the extreme optimization of the compute-intensive parts of the code without de-structuring it;

6. Use a plurality of plug-and-play solvers so they can be chosen according to the execution platform. This consideration is not related to code architecture only, it must be consistent with numerical schemes.

Coding Rules – Coding rules, specific to each application and ecosystem, must be seen as a set of guidelines meant to help developers. Their use aims at preserving code efficiency, maintenance and evolution. This practice is well developed in the Java world where coders are encouraged to follow best practices in order to ease code maintenance and bug tracking [38]. In the HPC domain, such rules would typically include code structures that favor vectorization, data locality, efficient parallel execution, etc. Code patterns for parallel programming are an efficient method to deal with these issues [39].

Libraries – Libraries provide highly-optimized, frequently used routines [40, 41]. Using external libraries is therefore recommended as long as they are carefully chosen, if only because some are ephemeral and available on a limited number of systems. It must be noted that their use also impacts the application building and distribution processes. The best practices regarding libraries are the following:

1. Choose native or open source libraries [42, 43] as they are usually very well optimized;

2. Avoid old algorithms that have been designed for sequential execution or no memory hierarchy (e.g. 1986 Numerical Recipes [44]);

3. Choose long lasting or easily replaceable libraries to limit adherence to a given platform.

Runtimes – Runtimes provide intermediate resource management services not directly provided by the operating system (e.g. StarPU [45, 46], X-Kaapi [47], MPC [48]). Using an efficient framework exposing a robust runtime may provide adaptation capabilities to the configuration of the target machine. It is expected that with the growing number of threads, hierarchical techniques will be needed to avoid high thread management overhead. However, this evolution may lead to less accurate scheduling and more workload unbalancing.

Debugging – Usually a post-mortem technique [49], debugging is a sensitive issue and it is difficult to integrate from the project start. However, ease of debugging usually results from the development methodology and the integration of the appropriate observation tools (e.g. tracing capabilities, visualization of code data structure, etc.) within the application. The use of tools such as Valgrind [50] is recommended even though code execution incurs an important slowdown (x10).

Vector and Data Parallelism – Vector parallelism refers to the use of vector instructions such as AVX; data parallelism refers to models such as the one proposed by GPUs (i.e. SIMT). Vector capabilities contribute in a large part to the performance of current processors (e.g. 80% on an Intel Xeon Phi). But the implementation of parallelism cannot be left to compilers alone. Code writing rules can greatly help to achieve efficient automatic vectorization by compilers, while directives can complement coding styles. On the other hand, the use of vector intrinsics is not advised as these are not portable and make code difficult to read and maintain.

Technological Watch – Anticipating hardware evolution has a cost especially when uncertainty generates multiple tracks. While it is recommended to resist the temptation of testing every new thing, technological watch is key to make the right decisions at the right time.

CONCLUSION

Processor evolution toward massive parallelism questions code evolution. Migrating or a designing a new code for the decade to come is an extremely challenging task that requires multiple algorithmic and technological choices. This document presented a set of recommendations to help making the right decisions. The listed practices also have the double benefit of mitigating the consequences of these choices and allowing flexibility in future evolution / adaption of the application code.

[References]

[26] Hudson – continuous integration.

[27] G. Grospellier and B. Lelandais. The arcane development framework. In Proceedings of the 8th Workshop on Parallel/High-Performance Object-Oriented Scientific Computing, POOSC ’09, pages 4:1–4:11, New York, NY, USA, 2009. ACM.

[28] M. Strembeck and U. Zdun. An approach for the systematic development of domain-specific languages. In Softw. Pract. Exper. 39(15):1253–1292, Oct. 2009.

[29] OpenACC Consortium. The OpenACC application programming interface. 2011.

[30] OpenMP Architecture Review Board. OpenMP application program interface version 3.0, May 2008.

[31] Boost.proto.

[32] W. Kirschenmann. Vers des noyaux de calcul intensif pérennes. PhD thesis, Université de Lorraine, 2013.

[33] L. Bautista-Gomez, S. Tsuboi, D. Komatitsch, F. Cappello, N. Maruyama, and S. Matsuoka. Fti: High performance fault tolerance interface for hybrid systems. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’11, pp 32:1–32:32, New York, NY, USA, 2011. ACM.

[34] S. B. S. Di, L. B. Gomez, and F. Cappello. Optimization of multi-level checkpoint model for large scale hpc applications. In IEEE IPDPS, 2014.

[35] A. Moody, G. Bronevetsky, K. Mohror, and B. R. d. Supinski. Design, modeling, and evaluation of a scalable multi-level checkpointing system. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’10, pages 1–11, Washington, DC, USA, 2010. IEEE Computer Society.

[36] J. Bennett, H. Abbasi, P.-T. Bremer, R. W. Grout, A. Gyulassy, T. Jin, S. Klasky, H. Kolla, M. Parashar, V. Pascucci, P. P. Pébay, D. C. Thompson, H. Yu, F. Zhang, and J. Chen. Combining in-situ and in-transit processing to enable extreme-scale scientific analysis. In SC 2012, 2012.

[37] A. Mascarenhas, R. W. Grout, P.-T. Bremer, E. R. Hawkes, V. Pascucci, and J. H. Chen. Topological feature extraction for comparison of terascale combustion simulation data. In Topological Methods in Data Analysis and Visualization, Mathematics and Visualization, pages 229–240. Springer Berlin Heidelberg, 2011.

[38] JPL Java Coding Standard.

[39] T. Mattson, B. Sanders, and B. Massingill. Patterns for Parallel Programming. Addison-Wesley Professional, first edition, 2004.

[40] National Science Foundation and Department of Energy. BLAS, 2010.

[41] L. S. Blackford, J. Choi, A. Cleary, E. D’Azeuedo, J. Demmel, I. Dhillon, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R. C. Whaley. ScaLAPACK User’s Guide. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 1997.

[42] Intel Math Kernel Library.

[43] AMD core math library.

[44] W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling. Numerical Recipes – The Art of Scientific Computing. Cambridge: Cambridge University Press, 1986.

[45] Starpu handbook, 2013.

[46] S. Henry. Programming Models and Runtime Systems for Heterogeneous Architectures. PhD thesis, Université de Bordeaux 1, 2013.

[47] T. Gautier, F. Lementec, V. Faucher, and B. Raffin. X-Kaapi: a Multi Paradigm Runtime for Multicore Architectures. In Workshop P2S2 in conjunction of ICPP, page 16, Lyon, France, Oct. 2013.

[48] M. Pérache, H. Jourdren, and R. Namyst. Mpc: A unified parallel runtime for clusters of NUMA machines. In E. Luque, T. Margalef, and D. Benitez, editors, Euro-Par, volume 5168 of Lecture Notes in Computer Science, pages 78–88. Springer, 2008.

[49] K. Pouget. Programming-Model Centric Debugging for Multicore Embedded Systems. PhD thesis, Université de Grenoble, 2014.

[50] Valgrind.

<1 2 3 4 >

Navigation