What is Code Modernization?

By Mike Pearce | September 21, 2015

Code Modernization – The 5 Stages in Practice

Stage 1
At the beginning of your optimization project, select an optimizing development environment. The decision you make at this step will have a profound influence in the later steps. Not only will it affect the results you get, it could substantially reduce the amount of work to do. The right optimizing development environment can provide you with good compiler tools, optimized, ready-to-use libraries, and debugging and profiling tools to pinpoint exactly what the code is doing at the runtime. Check out the webinars on the Intel® Advisor XE tool, that can be used to identify vectorization & threading opportunities.

Stage 2
Once you have exhausted the available optimization solutions, in order to extract greater performance from your application you will need to begin the optimization process on the application source code. Before you begin active parallel programming, you need to make sure your application delivers the right results before you vectorize and parallelize it. Equally important, you need to make sure it does the minimum number of operations to get that correct result. You should look at the data and algorithm related issues such as:

Choosing the right floating point precision
Choosing the right approximation method accuracy; polynomial vs. rational
Avoiding jump algorithms
Reducing the loop operation strength by using iteration calculations
Avoiding or minimizing conditional branches in your algorithms
Avoiding repetitive calculations, using previously calculated results.

You may also have to deal with language-related performance issues. If you have chosen C/C++, the language related issues are:

Use explicit typing for all constants to avoid auto-promotion
Choose the right types of C runtime function, e.g. doubles vs. floats: exp() vs. expf(); abs() vs. fabs()
Explicitly tell compiler about point aliases
Explicitly Inline function calls to avoid overhead

Stage 3
Try vector level parallelism. First try to vectorize the inner most loop. For efficient vector loops, make sure that there is minimal control flow divergence and that memory accesses are coherent. Outer loop vectorization is a technique to enhance performance. By default, compilers attempt to vectorize innermost loops in nested loop structures. But, in some cases, the number of iterations in the innermost loop is small. In this case, inner-loop vectorization is not profitable. However, if an outer loop contains more work, a combination of elemental functions, strip-mining, and pragma/directive SIMD can force vectorization at this outer, profitable level.

SIMD performs best on “packed” and aligned input data, and by its nature penalizes control divergences. In addition, good SIMD and thread performance on modern hardware can be obtained if the application implementation puts a focus on data proximity.
If the innermost loop does not have enough work (e.g., the trip count is very low; the performance benefit of vectorization can be measured) or there are data dependencies that prevent vectorising the innermost loop, try vectorising the outer loop. The outer loop is likely to have control flow divergence; especially of the trip count of the inner loop is different for each iteration of the outer loop. This will limit the gains from vectorization. The memory access of the outer loop is more likely to be divergent than that of an inner loop. This will result in gather / scatter instructions instead of vector loads and stores and will significantly limit scaling due to vectorization. Data transformations, such as transposing a two dimensional array, may alleviate these problems, or look at switching from Arrays of Structures to Structures of Arrays.

When the loop hierarchy is shallow, the above guideline may result in a loop that needs to be both parallelized and vectorized. In that case, that loop has to both provide enough parallel work to compensate for the overhead and also maintain control flow uniformity and memory access coherence.

Stage 4
Now we get to thread level parallelization. Identify the outermost level and try to parallelize it. Obviously, this requires taking care of potential data races and moving data declaration to inside the loop as necessary. It may also require that the data be maintained in a cache efficient manner, to reduce the overhead of maintaining the data across multiple parallel paths. The rationale for the outermost level is to try to provide as much work as possible to each individual thread. Amdahl’s law states: The speedup of a program using multiple processors in parallel computing is limited by the time needed for the sequential fraction of the program. Since the amount of work needs to compensate for the overhead of parallelization, it helps to have as large a parallel effort in each thread as possible. If the outermost level cannot be parallelized due to unavoidable data dependencies, try to parallelize at the next-outermost level that can be parallelized correctly.

If the amount of parallel work achieved at the outermost level appears sufficient for the target hardware and likely to scale with a reasonable increase of parallel resources, you are done. Do not add more parallelism, as the overhead will be noticeable (thread control overhead will negate any performance improvement) and the gains are unlikely.

If the amount of parallel work is insufficient, e.g. as measured by core scaling that only scales up to a small core count and not to the actual core count, attempt to parallelize additional layer, as outmost as possible. Note that you don’t necessarily need to scale the loop hierarchy to all the available cores, as there may be additional loop hierarchies executing in parallel.
If step 2 did not result in scalable code, there may not be enough parallel work in your algorithm. This may mean that partitioning a fixed amount of work among many threads gives each thread too little work, so the overhead of starting and terminating threads swamps the useful work. Perhaps the algorithms can be scaled to do more work, for example by trying on a bigger problem size.

Make sure your parallel algorithm is cache efficient. If it is not, rework it to be cache efficient, as cache inefficient algorithms do not scale with parallelism.

Check out the Intel Guide for Developing Multithreaded Applications series for more details.

Stage 5
Lastly we get to multi-node (Rank) parallelism. To many developers message passing interface (MPI) is a black box the “just works” behind the scenes, to transfer data from one MPI task (process) to another. The beauty of MPI for the developer is that the algorithmic coding is hardware independent. The concern that developers have, is that with the many core architecture with 60+ cores, the communication between tasks may create a communication storm either within a node or across nodes. To mitigate these communication bottlenecks, applications should employ hybrid techniques, employing a few MPI tasks and many OpenMP threads.

A well-optimized application should address vector parallelization, multi-threading parallelization, and multi-node (Rank) parallelization. However to do this efficiently it is helpful to use a standard step-by-step methodology to ensure each stage level is considered. The stages described here can be (and often are) reordered depending upon the specific needs of each individual application; you can iterate in a stage more than once to achieve the desired performance.

Experience has shown that all stages must at least be considered to ensure an application delivers great performance on today’s scalable hardware as well as being well positioned to scale effectively on upcoming generations of hardware.

<1 2 >

Navigation