After a first release that proved essential to move parallel computations to accelerators in a standardized way, OpenACC 2.0 was long due. Now that it’s here, this series of articles proposes to help you make the most of its new capabilities. Last month, we focused on the new data management features. This month, let’s take a practical look at new ways to improve the management of compute regions.
Stéphane Chauveau, PhD.
The OpenACC 2.0 specification features a new section about atomic directives. Developers familiar with OpenMP may have a sense of déjà-vu while reading it: OpenACC atomics management is almost an exact copy of the OpenMP specification with only of a few minor changes such as replacing the OMP directive sentinel by the ACC sentinel. This is of course absolutely intentional. There was no practical reason to come up with a new syntax for a feature with almost identical requirements. As a consequence, developers will be able to easily migrate codes using atomics from one standard to the other.
The problem of atomics is simple to understand but can lead to very annoying bugs. Imagine an OpenMP or OpenACC application with at least 2 threads (or gangs or workers) that may both attempt to increment a single global counter X. In practice, this task will be broken down into three distinct actions: read X from memory into a register, increment the register and write the register back to X. Within a given thread, the order of these actions is perfectly defined, but when considering multiple threads, it’s a completely different story. The pseudo-code in listing 1 illustrates a possible behavior where the incrementation of X by two threads produces a final X that is actually incremented only once.
The situation is even more chaotic in real life as the memory caches found on all modern GPUs and CPUs are adding another level of complexity. Fortunately, all modern accelerators also provide mechanisms to implement these actions in a way that appears to be atomic regarding the memory accesses. In OpenACC 2.0, incrementing a scalar variable atomically is as simple as inserting an atomic update pragma before the increment statement, as illustrated in listing 2. Most of the basic arithmetic operations on the native scalar types of each language are officially supported but some of them may not be available on all targets since implementing proper atomic operations usually require some kind of hardware support.
The atomic update directive is perfectly suited to count for instance the number of errors or matches in the entire execution of an OpenACC compute construct. However, it should not be used when the thread performing the update has to know the updated value. To understand why, let’s just consider a large sparse vector and a function to compact its non-zero values into a vector of at most 1,000 elements. The naive approach shown in listing 3 is incorrect because all occurrences of the n variable found outside the atomic operation are subject to race conditions with other threads. In other words, any of these n may have a different value within the same loop iteration.
This issue is solved by the atomic capture variant which provides an atomic update of a variable but also a copy of the original or of the resulting value into another variable. The proper version of our sparse vector packing function using an atomic update is given in listing 4.
Efficiently targeting device types
A few years ago, when the OpenACC working group started to develop the specification, the promise was to write once and execute efficiently everywhere. If the portability issue is more or less solved today, efficient execution on multiple platforms from the same sources is often just not possible because of the architectural differences between the accelerators currently found in the market. In practice, the number of gangs, the number of workers and the vector length of a parallel construct usually need to be manually fine-tuned for each target accelerator in order to get consistent performance. The impact of using an improper value if often dramatic, and that is especially true in environments using both Xeon Phi’s and GPUs from NVIDIA or AMD. Compilers will of course attempt to implement heuristics so as to choose the best values according to the target but the result will rarely be optimal.
A partial solution to this problem is the new device_type or dtype clause, which provides a mechanism to restrict some clauses to a specific target. The device_type clause needs a keyword to specify the target; its role is to make further clauses specific to that target. The effect device_type stops when another device_type clause is found or at the end of the directive. Note also that the special target name * acts as a default that is used when no other device_type clause is matched. Simply speaking, device_type is like a C switch that would be applied to clauses within a directive. Listing 5 shows how to use device_type to customize the number of gangs and workers in a loop.
The OpenACC specification does not enforce specific names for the different target architectures available today. The NVIDIA, RADEON and XEONPHI “keywords” used in listing 5 are the recommended names for the three major kinds of accelerators but compilers are supposed to provide mechanisms to create other keywords (that would typically be a command line option). In listing 5, it is also assumed that NVIDIA2 is a user defined keyword for a specific NVIDIA GPU. Unknown target keywords are simply ignored by the OpenACC compiler.
The device_type clause is only allowed on a few directives, namely routine, kernels, parallel, loop, update and all their combinations. Be also aware that a certain number of clauses are not allowed after a device_type. A quick look at your directives’ documentation will give you a detailed list of the possibilities you have.
© HPC Today 2020 - All rights reserved.
Thank you for reading HPC Today.