An Introduction to Performance Programming (part II)

By Romain Dolbeau | March 12, 2014

One of the consequences is that in a two-socket system (two NUMA nodes), only half the memory will be used by the single-thread process (unless consuming a large amount of memory). It also means that only half the memory bandwidth from the entire system can be exploited, leaving the second memory controller in the second socket unused. Another consequence is that if the code subsequently becomes multi-threaded (for instance through the use of OpenMP directives), then half the threads will exclusively use remote memory, and performance will suffer accordingly.

Depending on the kind of workload, there are some simple heuristics to limit the nefarious effects of NUMA:

– Single process, single thread: the default behavior will only use the local NUMA node as explained. This might be a good thing (it optimizes the latency) or a bad thing (it cuts the available memory bandwidth in half). Some code might benefit from interleaving pages on all nodes, either by prefixing the command with numactl –interleave=all, or by replacing the allocation function such as malloc by a NUMA-specific function such as numa_alloc_interleaved.

– Multiple processes, each single thread: this is the norm for MPI-based code that does not include multithreading. The default behavior of Linux (provided the process have been properly “pinned” to their CPU) is excellent, as all processes will have low latency memory, yet the multiplicity of processes will ensure the use of all memory controllers and available bandwidth.

– Single process, multiple threads: this is the norm for OpenMP-based code. The default behavior is usually quite bad; more often than not, the code will cause the allocation of physical pages on the master thread’s node, despite parallel sections running on all CPUs. Reading data from a file, receiving data from the network, explicit non-parallel zeroing or initialization of the allocated memory will all lead to this single-node allocation. Forcing interleaving as above usually helps (by utilizing all available bandwidth), but is not optimal (it forces the averaging of latency rather than minimizing it); it still should be tried first, as it is very easy to implement. The ideal solution is to be able to place each page on the node whose threads will make the most use of it, but that is not a simple task.

– Multiple processes, with multiple threads: although it might seem the most complicated case, it usually isn’t: the typical placement of one process in each socket, with as many threads as there are cores in the socket leads to an optimal placement by the default behavior. No matter which thread allocates the memory, all the threads of the same process will see optimal latency.

III.C – False sharing – False sharing is a long-standing problem in a cache coherent multi-core machine (it is described as “widely believed to be a serious problem” by Bolosky and Scott [15]). Modern CPUs generally use invalidation-based coherency protocol (relevant details can be found in Culler et al. [16]), whereas every write to a memory address requires the corresponding cache line to be present in the local CPU cache with exclusive write access- that is, no other cache holds a valid copy.

“True sharing” occurs when a single data element is used simultaneously by more than one CPU. It requires careful synchronization to ensure respect of the semantic, and incurs a performance penalty as each update requires the invalidation of the other CPU’s copy. Back-and-forth can become very costly but is unavoidable if required by the algorithm (unless, of course, one changes the algorithm).

“False sharing” occurs when two different data elements are used by two different CPUs, but happen to reside in the same cache line. There is no need for synchronization. However, because the granularity of invalidation is the whole cache line, each data update by a CPU will invalidate the other CPU’s copy. This requires costly back-and-forth of the cache line. This can greatly affect performance and is completely unnecessary: if the two data elements lived in different cache lines, no invalidation would occur for either, thus ensuring maximum performance.

For very small data structures (such as a per-thread counter), this can be a huge problem as each update will lead to an unnecessary invalidation. For instance, listing 4 describes a structure of two elements: neg and pos. Without the intermediate padding array pad, those two values would share a cache line, and when used by two different threads, would cause a major performance issue. The padding ensures that they are in different cache lines, thus avoiding the problem. A synthetic benchmark using this structure running two threads on two different sockets in a dual X5680 system would see a increase in time of 50% as opposed to when the padding was not in use. Hardware counters show a very large number of cache line replacement in the modified state when padding is not used, versus a negligible number when it is (cache coherency and the modified state are well explained by Culler et al. [16]).

For large arrays that are updated in parallel by multiple threads, the data distribution among threads will be an important factor. If using a round-robin distribution (i.e. thread number n out of N total threads updates all elements of indices m such as n ≡ m (mod N)), then false sharing will occur on most updates – not good. But if a block distribution was used, where each thread accesses 1/N of the elements in a single continuous block, then false sharing only occurs on cache lines at the boundaries of each block – at most one cache line will have shared data between two threads. Not only will the number of unnecessary invalidations be small relative to the total number of accesses, but they will usually be masked by the fact that one thread will update at the beginning of the computation, and the other one at the end, thus minimizing the effect. And to ensure conflict-free accesses, the block size that each thread handles can be rounded to a integer multiple of the cache line size, at the cost of a slightly less efficient load balancing between threads.

CONCLUSION

Here is the end of this series of two articles that introduces a few key points for the newcomer to keep in mind when trying to improve the performance of code. To summarize, it is important to be able to justify the validity of the work done, in terms of reliability, speed and the choice made when improving the code. Other points are to properly exploit the tools at hand before modifying the code, to understand the relation between the data structures and the performance of the code, avoid re-inventing the wheel and exploit pre-existing high-performance implementations of algorithms, and finally run parallel code in a manner adapted to the underlying hardware.

Hopefully, this will help programmers and scientists alike to understand some key aspects of performance and obtain higher performing code without an excessive amount of work. Happy (performance) programming!

[References]

[15] W. J. Bolosky and M. L. Scott, False sharing and its effect on shared memory performance, in USENIX Systems on USENIX Experiences with Distributed and Multiprocessor Systems – Volume 4, ser. Sedms’93. Berkeley, CA, USA: USENIX Association, 1993, pp. 3–3.

[16] D. E. Culler, A. Gupta, and J. P. Singh, Parallel Computer Architecture: A Hardware/Software Approach, 1st ed. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1997.

<1 2 3 >

Navigation