Priority given to eXtended Reliable Connection (XRC)?
In InfiniBand transports there is no distinction between a connection at core level and one at node level. The idea behind XRC is therefore to authorize a unique connection between a process and an entire node, which reduces the number of connections required for each MPI process to the total number of nodes. Thus, when using XRC, each MPI process should make (N-1) connections to be able to communicate with all other MPI processes.
Figure 1 illustrates this difference. By the way, note that XRC can be used with most MPI libraries, including of course Open MPI, Mvapich2 and Intel MPI. Note also that, in the last mentioned case, it only operates with OpenFabrics Alliance resources and not with DAPL.
Performance-wise, however, the advantages are not so clear. Although some gains are observed in certain applications, since there is more available memory space, we cannot unfortunately quantify them precisely. We did code a number of tests based on Mvapich2 1.9, but it turns out that XRC has a bug (at least in the Stampede cluster) making the results unusable. Writing this article will at least have enabled us to detect it and report it to the developers team. Moreover, experience teaches us that XRC is costly in very large applications involving many cores or nodes. In light of this and with some reserves, it may be better to use the Unreliable Datagram layer, as we shall now see.
Unreliable Datagram (UD)
Unlike the RC and XRC layers, UD is an unconnected transport system, which reduces the demand on memory resources for MPI processes. The downside is that UD has far fewer functions. For example, RDMA is not supported and message size is limited to the value of the MTU. Therefore, the MPI library has to work harder, especially when managing segmentation, sequence and retransmission of messages. In fact, the cost in terms of performance is not negligible in application contexts marshalling a small number of nodes and operating with large messages.
On the other hand, for codes using mainly short messages, the amount of memory required for each MPI process remains almost constant as the size of the system increases. Plus, given that time no longer needs to be lost in making connections, UD performs better when a lot of nodes are being used.
More around this topic...
In the same section
© HPC Today 2024 - All rights reserved.
Thank you for reading HPC Today.