An Introduction to Performance Programming (part II)

By Romain Dolbeau | March 12, 2014

While it is always difficult to optimize existing codes, an appropriate methodology can provide provable and replicable results. Last month, we saw that choosing the right test cases was essential to validate and evaluate optimizations, we then concentrate on efficiently using the compiler to optimize and to vectorize. This month, let’s see how data layouts, algorithms implementation as well as parallelization impact performance…

Romain Dolbeau – HPC Fellow, CAPS Entreprise.

As previously mentioned, a minimal understanding of the hardware is required to exploit it. Efficiently utilizing the computational resources of the CPU is important, but since the term “memory wall” was coined by Wulf and McKee [1], it has been known that memory accesses were the limiting factor for many kinds of codes. The expression “bandwidth-bound algorithm” is dreaded by many of those tasked with improving performance, as it means that the available bandwidth from the main memory to the CPU cores is the limiting factor. But even without changing the algorithm, it is sometimes possible to drastically improve performance.

I. DATA LAYOUTS

One of the main issues with memory bandwidth is the amount wasted by poor data layouts. This issue is not new, and was already targeted by papers such as the work of Truong et al. [2]. But it still seems to be largely unknown to most developers. We first have to go back to computer architecture (and mention again Hennessy & Patterson [3]). The main technique for a CPU to avoid the hundreds of cycles required for data to arrive from main memory are caches, tiny but fast memories close to the execution cores. While their existence is known, it seems that most programmers are unaware of their behavior and write code that greatly impedes their efficiency.

I.A – Arrays of structures vs. structures of arrays – Caches do not retain data at single element granularity, e.g. a single double-precision value. While this would help with temporal locality (the fact that a recently used element might soon be reused), it would do nothing for spatial locality (the fact that it’s likely the next useful element will be a memory neighbor to a recently used element). To gain such spatial locality, caches have a granularity of a cache line, a contiguous amount of memory. Common sizes are 32 or 64 bytes of well-aligned memory (i.e. the address of the first byte is an integer multiple of the cache line size). Whenever a code requires an element from memory, the entire cache line is loaded from the main memory and retained in the cache. If a neighboring element from the same cache line is subsequently needed, the access will be a cache hit, and therefore very fast. This is a well known mechanism, taught in most computer science classes.

But the implications are not always well understood. If a cache line is 64 bytes, then every memory transaction will involve the whole 64 bytes, no matter how few bytes are actually needed by the code. If only a single double-precision element (8 bytes) is needed from each cache line, 87.5% of the memory bandwidth is wasted on unused bytes. Therefore, it is very important to ensure that as few elements as possible in each loaded cache line are not used. Unfortunately, some extremely common programming techniques goes against that principle. The most obvious offenders are structures (and of course objects, which are generally structures with associated functions).

While there is a lot to be said in favor of structures, that is not our subject; therefore we’ll look at the downside, the way they can sometimes waste memory bandwidth. The content of a well-defined structure is a lot of information related to an abstract concept in the code – it could be a particle used in fluid simulation, the current state of a point in a discretized space, or even an entire car. All the occurrences of that concept will be allocated in what is described as an Array of Structures. Subsequently, functions in the code will go through all occurrences in a loop to process the data, such as this:

foreach particle in charged particles
update_velocity (particle, electric field)

Presumably, the function update_velocity will change the speed of the charged particle in the electric field. But while the speed might be defined with only a few values (e.g. the current velocity in X, Y and Z), the structure itself is likely to contain much more information that’s irrelevant to that particular step of computations. But as structures are contiguous in memory, much of that information will be loaded as they belong to the same cache line. Let’s assume each particle in our example is defined by 16 double-precision values, i.e. 128 bytes. The structure would occupy 2 full cache lines all by itself. Let’s also assume all 3 velocities are in the same half of the structure, i.e. in the same cache line. Each time we need to update a velocity, the CPU will:

1) Load the cache line (64 bytes) containing the three velocities (24 bytes);

2) Perform any required computations, and update the value in the cache;

3) Eventually, when space in the cache is needed, the whole cache line (64 bytes) will be flushed to memory.

62.5% of the memory bandwidth required to load and store the particle is wasted by unnecessary data. If the three velocities were spanning both halves of the structure, then the bandwidth requirement would double for the same amount of useful data – wasting 81.25%.

One solution for this issue is called structures of arrays. As the name implies, the idea is to inverse the relationship by first allocating all atoms of data in arrays, and then grouping them into a structure. Instead of having an array of particles, each with its velocities, the code would use three arrays of velocities. Each array would be contained as a single velocity in X, Y or Z for all particles. The result for the function update_velocity above is that we would need three cache lines instead of one: one for each of the X, Y and Z velocities. But that is not a bad thing. The first particle would require 192 bytes of bandwidth, loading 8 elements of each array. However, the next 7 particles would require no bandwidth at all – their data had been prefetched by the first particle, thanks to spatial locality. The average bandwidth per particle is therefore an optimal 24 bytes per particle, with a greatly reduced latency because of locality. If that function was purely bandwidth-bound, we would have gained a factor of x2.66 by reorganizing data in a cache-friendly manner.

I.B – Multi-dimensional array vs. array of pointers – This is an issue that doesn’t exist in the Fortran language, where multi-dimensional arrays are the norm. But in the C language family the issue is pervasive. A lot of code utilizes not multi-dimensional arrays but array of pointers, adding a useless intermediate load. Listing 1 illustrates this form of inefficient code. The data is stored behind a pointer to a pointer (hence the two stars). The body of the code looks good to the untrained eye: it has a nice pair of square brackets to access the element of the two dimensional data, as is done in Fortran. But the performance will be suboptimal. The real meaning of the code includes not one, but two chained loads to access the data. The machine must first evaluate A[i], itself a pointer to double. Then this pointer is used to retrieve A[i][j], the data itself. This is effectively an indirect access.

An efficient way can be exactly the same in the body, by utilizing a properly typed pointer to the data. This is illustrated in listing 2. By specifying the dimensions of the array in the parameter list, it becomes possible to use the clean multi- dimensional notation of the C language – which unfortunately looks exactly the same as a pointer-to-pointer double dereference. This new version will simply compute the linear access i * n + j, and do a single lookup in memory to access the data. Of course, the data allocation is also different: instead of first allocating the array of the pointer, followed by a loop allocating all the pointers to double, a single allocation of the entire data set is used. The exact same allocation that would be used for the ugly, explicitly linearized version illustrated in listing 3 that most programmers justifiably try to avoid.

[References]

[1] W. A. Wulf and S. A. McKee, Hitting the memory wall: implications of the obvious, SIGARCH Comput. Archit. News, vol. 23, no. 1, pp. 20–24, Mar. 1995.

[2] D. Truong, F. Bodin, and A. Seznec, Improving cache behavior of dynamically allocated data structures, in Parallel Architectures and Compilation Techniques, 1998. Proceedings. 1998 International Conference on, 1998, pp. 322–329.

[3] J. L. Hennessy and D. A. Patterson, Computer Architecture: A Quantitative Approach. San Mateo, CA: Morgan Kaufmann, 1990.

<1 2 3 >

Navigation