Optimizing MPI implementations (Part I)

By Jérôme Vienne, PhD | March 17, 2014

Often unknown to scientists, techniques for optimizing underlying MPI layers in applications can be decisive in terms of performance. This month, two specific examples show that with a small concerted effort, results can change dramatically …

Jérôme Vienne, PhD
HPC Software Tools Group
Texas Advanced Computing Center (TACC)

One of my duties here at Texas Advanced Computing Center (TACC) in Austin is to assist scientists using our clusters to achieve the best performance from their applications. For several days, every three months, I respond to requests/tickets showing that some researchers regard MPI implementations as fixed and unchanging – “black boxes” to be dealt with as they are with no possibility of alteration or optimization.

A direct consequence of this fundamental error is a degree of frustration with the global applicative performance levels obtained. The messages reveal, specifically, an ignorance of the mechanisms involved in the implementations concerned. In this article, which is aimed more at scientists than pure MPI developers, I will try to explain the importance of the mapping of MPI processes and the possibilities arising from it, and then tackle the thorny question of choosing the most suitable benchmark to achieve tangible code improvements. Note that this first part will be completed by a second one dealing with good practices to be followed when scaling up and exploiting the many paths of performance enhancement offered by InfiniBand connections.

Every MPI implementation is different

Although all MPI implementations observe the same standard, each has to be considered individually. Why? Because there is nothing in the standard that defines the algorithms to be used for collective communications or the sequence of MPI processes within a node, for example. These decisions are left to the discretion of developers and it is precisely these choices that make the difference. Given the infinite variety of existing architectures and application processes out there, it is of course illusory to try and find the best possible tuning in every case. It is true that some proprietary implementations benefit from more or less proprietary physical layers (Cray, IBM, etc.) but, even here, an MPI implementation may require a different tuning than that recommended by default.

In practice, it is often possible to optimize some aspects of an implementation’s performance with environmental variables or global options. These small changes may have a major impact on the efficiency of your applications. But to do so appropriately, you need to know the range of possibilities open to you by the implementation you chose to work with, and this implies a minimum reading of the technical documentation. However tedious that may be, the effort usually pays off. Let’s take two specific – and convincing – examples.

Mapping MPI processes

It is easy to imagine that communication between two nodes is more demanding than communication within a single node. It is also true, however, that performance levels vary according to whether communication takes place within the same socket or between two sockets.

To illustrate the point, we measured latency by running the test included in MVAPICH2 1.9’s OSU Micro-benchmarks on the Stampede machine. Stampede cluster nodes have two sockets, each with 8 cores, with cores numbered 0 to 7 on the first socket and 8 to 15 on the second one. Figure 1 clearly shows that the communication latency within a single socket is significantly less than that between two sockets. The reason is simple: when communication takes place within a single socket, the data remain in the same L3 cache, which logically speeds up the exchange of data.

The default mapping of MPI processes depends entirely on the decisions made by the MPI implementation developers. A good arrangement of MPI processes is therefore recommended to optimize utilization of cache memory. And so, to achieve this, two questions must be addressed:

– What is the default arrangement used by my implementation?

– How can I alter it?

Many MPI implementations indicate how they arrange MPI processes. If yours does not, use the listing 1 code to get the process to core correspondence sequence (“cpuset”). Now, to highlight the mapping differences of different implementations, we have used this program with four MPI processes on Stampede using Intel MPI and MVAPICH2.

Listing 2 shows the result obtained with Intel MPI. Without an appropriate explanation, this sequence may appear surprising: the 16 lines show that each MPI rank is repeated 4 times with a different cpuset. The reason is that each MPI process is free to migrate to certain cores. Rank 0 can go to cores 0 to 3, rank 1 to cores 4 to 7, and so on. In hybrid programming (MPI + OpenMP) this is a sensible choice. With four MPI processes using four OpenMP threads, for instance, you can be sure the threads will use the same cache. On the other hand, in a pure MPI implementation, any migration of MPI processes may be slightly disruptive.

Listing 3 shows a very different result with MVAPICH2. Here, matters are clearer: each MPI process is associated with a single core. The advantage is that for small groups of messages, exchanges are faster. However, for large groups of messages, cache misses may occur. And in our example of hybrid programming (four MPI processes with four OpenMP threads) the results are terrible, as the four threads will use the same core.

This shows just how important it is to use the options available for each implementation in the quest for optimizing an application’s basic performance. These options make it quite easy to graft the mapping of tasks of Intel MPI onto that of MVAPICH2 – or vice versa. And what can be achieved with these two implementations can also be done with most others…

<1 2 >

Navigation