When it comes to performance, your code matters!
Mike Pearce, Intel HPC Developer Evangelist
Modern high performance computers are built with a combination of resources including: multi-core processors, many core processors, large caches, high speed memory, high bandwidth inter-processor communications fabric, and high speed I/O capabilities. High performance software needs to be designed to take full advantage of these wealth of resources. Whether re-architecting and/or tuning existing applications for maximum performance or architecting new applications for existing or future machines, it is critical to be aware of the interplay between programming models and the efficient use of these resources. Consider this a starting point for information regarding Code Modernization. When it comes to performance, your code matters!
Building parallel versions of software can enable applications to run a given data set in less time, run multiple data sets in a fixed amount of time, or run large-scale data sets that are prohibitive with un-optimized software. The success of parallelization is typically quantified by measuring the speedup of the parallel version relative to the serial version. In addition to that comparison, however, it is also useful to compare that speedup relative to the upper limit of the potential speedup. That issue can be addressed using Amdahl’s Law and Gustafson’s Law.
Good code design takes into consideration several levels of parallelism.
The first level of parallelism is Vector parallelism (within a core) where identical computational instructions are performed on large chunks of data. Both scalar and parallel portions of code will benefit from the efficient use of vector computing.
A second level of parallelism called Thread parallelism, is characterized by a number of cooperating threads of a single process, communicating via shared memory and collectively cooperating on a given task.
The third level is when many codes have been developed in the style of independent cooperating processes, communicating with each other via some message passage system. This is called distributed memory Rank parallelism, so named as each process is given a unique rank number.
Developing code which uses all three levels of parallelism effectively, efficiently, and with high performance is optimal for modernizing code.
Factoring into these considerations is the impact of the memory model of the machine: amount and speed of main memory, memory access times with respect to location of memory, cache sizes and numbers, and requirements for memory coherence.
Poor data alignment for vector parallelism will generate a huge performance impact. Data should be organized in a cache friendly way. If it is not, performance will suffer, when the application requests data that’s not in the cache. The fastest memory access occurs when the needed data is already in cache. Data transfers to and from cache are in cache-lines, and as such if the next piece of data is not within the current cache-line or is scattered amongst multiple cache-lines, the application may have poor cache efficiency.
Divisional and transcendental math functions are expensive even when directly supported by the instruction set. If your application uses many division and square root operations within the run-time code, the resulting performance may be degraded because of the limited functional units within the hardware; the pipeline to these units may be dominated. Since these instructions are expensive, the developer may wish to cache frequently used values to improve performance.
There is no “one recipe, one solution” technique. A great deal depends on the problem being solved and the long term requirements for the code, but a good developer will pay attention to all levels of optimization, both for today’s requirements and for the future.
Intel has built a full suite of tools to aid in code modernization – compilers, libraries, debuggers, performance analyzers, parallel optimization tools and more. Intel even has webinars, documentation, training examples, and best known methods and case studies which are all based on over thirty years of experience as a leader in the development of parallel computers.
Code Modernization 5 Stage Framework for Multi-level Parallelism
The Code Modernization optimization framework takes a systematic approach to application performance improvement. This framework takes an application though five optimization stages, each stage iteratively improving the application performance. But before you start the optimization process, you should consider if the application needs to be re-architected (given the guidelines below) to achieve the highest performance, and then follow the Code Modernization optimization framework.
By following this framework, an application can achieve the highest performance possible on Intel® Architecture. The stepwise approach helps the developer achieve the best application performance in the shortest possible time. In another words, it allows the program to maximize its use of all parallel hardware resources in the execution environment. The stages:
Leverage optimization tools and libraries: Profile the workload using Intel VTune Amplifier to identify hotspots and Intel Advisor XE to identify vectorization and threading opportunities. Use Intel compilers to generate optimal code and apply optimized libraries such as Intel Math Kernel Library, Intel TBB, and OpenMP when appropriate.
Scalar, serial optimization: Maintain the proper precision, type constants, and use appropriate functions and precision flags.
Vectorization: Utilize SIMD features in conjunction with data layout optimizations Apply cache-aligned data structures, convert from arrays of structures to structure of arrays, and minimize conditional logic.
Thread Parallelization: Profile thread scaling and affinitize threads to cores. Scaling issues typically are a result of thread synchronization or inefficient memory utilization.
Scale your application from multicore to many core (distributed memory Rank parallelism): Scaling is especially important for highly parallel applications. Minimize the changes and maximize the performance as the execution target changes from one flavor of the Intel architecture (Intel® Xeon® processor) to another (Intel Xeon Phi Coprocessor).
More around this topic...
© HPC Today 2019 - All rights reserved.
Thank you for reading HPC Today.