Discovering OpenACC 2.0: the new compute optimization features

By Stéphane Chauveau, PhD | February 11, 2014

User-defined calls

In version 1.1, OpenACC was not very explicit about function calls. Calling intrinsic mathematical routines was inherently possible but calling user-defined functions was not really part of the specification. This led compiler vendors to accept calls to user-defined functions as an extension to the standard. OpenACC 2.0’s routine directive is an attempt at standardizing calls to user-defined functions on the device. At first glance, the problem may seem simple, but considering the three levels of parallelism in the OpenACC execution model, it is not. For instance, a function that contains a gang loop should never be called from inside another gang loop, and similar issues exist at the worker and vector levels.

Accordingly, the new routine directive must be specified for each user-defined function called from within an accelerated region. In C and C++, the directive will typically appear before the function; in Fortran, it will be placed inside the specification part of the function or subroutine.

The routine directive must also contain a gang, worker, vector or seq clause to specify the context in which it may be called and the maximal worksharing level allowed in its body. For example, a routine with a worker clause can contain worker, vector and seq loops but no gang loops. Similarly, this routine may not be called from within the body of a worker or vector loop. The proper syntax is illustrated in listing 6. Routines that do not contain any kind of loop worksharing must have a seq clause indicating that they can be called from anywhere.

Calling CUDA or OpenCL functions

Native functions written in CUDA or OpenCL haven’t been forgotten. To call them, a bind clause must be specified on the routine directive of the function declaration in order to specify the name of the corresponding native function. The device_type mechanism can then be used to select the right native function name for each target as illustrated in listing 7 with pseudo CUDA and RADEON versions of our sum function. Note that OpenACC 2.0 does not specify an exact mechanism to integrate external native code, so the mechanism may vary from one compiler to another.

An alternative form of the bind clause can be used to generate customized versions of a routine for different accelerator targets. The bind argument is not a string anymore; instead, you can use the name of an alternative C, C++ or Fortran routine that will be compiled in OpenACC mode. Any call to the original function inside an accelerated region is then replaced by a call to the specified alternative function. As for native routines, the proper bind clause can be selected using the device_type mechanism. All this is illustrated in listing 8.

When tiling becomes standard

The OpenACC committee has also been working on some high level loop tuning features that may improve code efficiency in some circumstances. In OpenACC 2.0, that work resulted in the tile clause of the loop directive, which can be used to decompose a loop nest into tiles of a fixed size. This is a classical code transformation whose purpose is to improve the locality of memory accesses inside the tiles and make the use of memory caches more efficient.

Listing 9shows a typical use case of the tile clause: the bi-dimensional convolution. The computation performed at each location is accessing several of the neighbor location in both directions. The effect of the tile clause is further illustrated in listing 10 where the code from listing 9 is translated into a “no-tile” version. Each of the loops affected by the tiling is broken down into two loops: an outer loop that iterates over the tiles and an inner loop that iterates inside each tile. The outer and inner loops are then collapsed using a collapse clause before applying the gang clause to the outer loops and the worker clause to the inner loop. As a result, cache re-use between workers of the same gang should be improved.

Needless to say, the code in listing 10 is over-simplified. Real codes generated by the compiler would have to account for loops with a non-constant number of iterations that may not even be exact multiples of the tile sizes. This is typically not something a normal user would want to write manually. Also, the tile clause is applicable to more than two loops and may use any of the gang, worker and vector levels of parallelization provided by OpenACC.

As we have seen, the OpenACC 2.0 specification provides several new features meant to improve the efficiency and portability of your codes. But with the recent release of OpenMP 4, which also brings accelerator support, we are often asked whether both standards will ultimately join and, if not, which one will ultimately “win”. This is a difficult question. Whereas OpenACC started as an attempt to speedup OpenMP’s development process, both specifications are now significantly different and can hardly be re-united. OpenACC has the advantage of being more mature while OpenMP is clearly recognized and supported by more players in the HPC market. The OpenMP model for accelerators is also more flexible but that is not necessarily an advantage because flexibility usually comes with a significant cost in term of raw performance. That is why the CUDA model with all its constraints was so successful over the last few years. Eventually, the success of both models will probably depend on the ability of accelerator vendors to find the right balance between performance and flexibility.

Happy programming!

<1 2 >

Navigation