Optimizing MPI implementations (Part II)

By Jérôme Vienne, PhD | March 17, 2014

Often unknown to scientists, techniques for optimizing underlying MPI layers in applications can be decisive in terms of performance. This time, we examine strategies for interacting with InfiniBand transports, highlighting a bug in MVAPICH2 and making a call for more evidence…

Jérôme Vienne, PhD
HPC Software Tools Group
Texas Advanced Computing Center (TACC)

While MPI libraries (MPICH, Open MPI, Intel MPI, Mvapich2, etc…) do observe a standard, they are also characterized by a set of specificities that need to be well understood if the aim is to fine-tune the applications that make use of them. In our previous article, we looked at the issue of locating MPI processes and the choice of algorithms used for collective communications. On that occasion we saw that the technical choices made by library developers were sometimes very different and that ignoring them could deprive us of significant opportunities for acceleration.

This month we will be focusing on the possibilities offered by MPI implementations when a large-scale application utilizes InfiniBand connections. As everyone knows, the number of cores present in TOP500 clusters is constantly growing. For scientists this is a real advantage: these enormous resources allow them to tackle increasingly complex problems. But using more cores also increases memory consumption for each MPI process…

In light of this, reducing the memory occupied by processes becomes critical. My experience at TACC has taught me that many users of computing clusters are unaware that the InfiniBand standard and its Mellanox implementation offer different ways of helping to reduce this memory consumption. We will, therefore, take a look at two of them more closely: the eXtended Reliable Connection (XRC) and the Unreliable Datagram (UD).

At the heart of InfiniBand

Year after year, InfiniBand has had pride of place in the HPC community. In June 2003, only one TOP500 configuration was using it; as the technology developed, its adoption rate expanded rapidly, so that it now equips 205 systems, i.e. 41% of its class. This makes it without doubt the main high-performance network architecture available today.

In practical terms, the speed of InfiniBand connections enables the length of communication phases to be reduced and, consequently, the waiting time at core level to decrease. In fact, an InfiniBand network is distinguished by two essential characteristics: short latency and an extended bandwidth synonymous with elevated transfer rates. In terms of memory, latency is the time needed to send and receive the first octet, whereas bandwidth indicates the maximum data transfer rate achieved by sending large messages.

From a technical perspective, InfiniBand’s architecture exhibits four kinds of transport layer: Reliable Connection (RC), Reliable Datagram (RD), Unreliable Datagram (UD) and Unreliable Connection (UC). The RC and UD layers being the best suited to use by MPI applications, it is these that are the focus of our attention here. We shall be intentionally ignoring the RD layer, which is not implemented by any hardware (unlike the other three) and cannot be used with MPI. As for UC, it does have a connected mode but, in addition to this mode not being very reliable, the order of packets arrival is not guaranteed. Moreover, although it does support RDMA accesses, the amount of memory needed to use it is identical to RC’s, which ultimately makes it of little interest for MPI libraries.

From a usage rate standpoint, RC takes the lion’s share. For developers, choosing RC is justified by its reliability, the fact that it operates in connected mode and because it is the transport method offering the largest number of possibilities in InfiniBand (RDMA, atomic action, etc.). Having said that, as the number of MPI processes increases, RC is also the most costly transport mode in terms of resources. Why? The explanation is simple: when in connected mode, each communication partner must establish a dedicated connection requiring several Kb of memory space. That’s not much, but when large-scale experiments are conducted, the combined memory space can quickly reach Gb proportions for each process. To give you an idea, if we consider N nodes each with C cores, each MPI process will need to make (N-1) x C connections to be able to communicate with other MPI processes. To overcome this problem, in view of the regular increase in the number of cores per socket, Mellanox has developed the eXtended Reliable Connection on its ConnectX cards. XRC is thus available on these cards and those that follow, however little the OFED (OpenFabrics Enterprise Distribution) library is used in version 1.3 or later.

<1 2 3 >

Navigation