OK but then what about the syntax differences for instance?
GCC supports both OpenMP and OpenACC syntaxes with the same underlying code. The trick is that if there are reasons for OpenACC to continue, they would be based on the ability for people to compile and move really large codes to the accelerator successfully and get good performance. As I said, OpenMP has a more restrictive model to run all OpenMP in one time on the device. That portability between devices that we have may allow us to continue. If we get to the point where all devices look the same, then yes there is a reason to just collapse to a single model.
While OpenACC aims at targeting accelerators from different vendors, we know that they all have their own hardware specific features. In that sense, how a generic programming model that abstracts hardware can still preserve high performance or at least sufficient efficiency?
I would not expect OpenACC to give as good performance as CUDA might but, when written with directives, the same code can be used in other places. Right now, CUDA is not available to other architectures than NVIDIA’s. We need a model that approaches the performance of CUDA; it does not have to promise to be as good, just good enough. People from Cray for instance are talking about getting 89% of the performance of CUDA with OpenACC on Titan. So, you can get to there. You may spend a lot of time getting there, but less time than writing CUDA directly. To be honest, OpenACC is a relatively new standard so the maturity of the compilers can also slows things down.
But to come back to the nature of your question relating to performance and hardware abstraction, OpenACC expresses three levels of parallelism inside the device. That seems to be enough to handle multicore CPU. Codes running on AMD, Intel or ARM host targets are all approachable with that one model. For GPUs and for other accelerators, these three levels with support for vectorization is efficient enough to allow you to execute on those devices.
So you can already run OpenACC on multicore CPUs with good performance, is that correct?
Multicore should be a target. There are more people right now who wants to run OpenMP or MPI on the host and then run OpenACC on the accelerator. But we expect to see OpenACC to be used everywhere.
Let’s take for instance the “Nested OpenACC Regions” in OpenACC 2.0 that are much like the “CUDA Dynamic Parallelism” introduced in CUDA 5.0 on Kepler. What’s the impact of using this kind of specific feature in accelerators other than NVIDIA Teslas?
There are two things here. First you get the ability for a kernel to launch another kernel and you are then able to express things like triangular loops where the second loop executes on smaller regions than the primary loop does. This feature is useful whenever you have a device that actually has the ability to launch a kernel on kernels. That covers all three AMD, Intel and NVIDIA accelerator architectures at this point. So this is as useful for AMD or Intel MIC as it is for NVIDIA.
But to be honest, I have not looked at the underpinning of how it behaves on MIC but CAPS or PGI can answer that question on the AMD side. This actually depends on how the compiler implements the nested parallelism. We care that the behavior is common across them and we look forward consistency of results. The compilers involved with this still think it is a bit of a race. One wants to do better than the other. Everyone has their own interest and brings their own strengths. For example, CAPS has the ability to take any different host compiler and generate either OpenCL code from OpenACC or CUDA code from OpenACC. That is an interesting approach. Samsung, that is not even a member of OpenACC, has also taken this approach of using OpenACC to produce OpenCL code for their targets. They used GCC to do that work, which they have made public. So, there is a FORTRAN-OpenACC 1.0 development that is already done and Samsung is now collaborating with the GCC people to bring OpenACC 2.0 support in GCC.
OpenACC is meant to be portable but performance is clearly not portable. As a practical example, one specific number of gangs or workers in an OpenACC code can be efficient with a given hardware but can also be dramatically slow on another type of hardware. What is the trade-off between portability and efficiency?
This is actually another good question to ask the compiler vendors. I know for example that there are defines that change the default number of gangs and workers for one device versus another. But the developer should not have to program these specificities, otherwise they will start tuning for one architecture or another. The compilers actually have their own view on how this should best be done and, in the end, you should leave this to the compiler. Still, you can tune for an architecture if you want to but it’s surprising how much the compiler vendors know what the right values are.
In that regard, I also would like to mention the fat binary effort going on that consists in making a binary containing codes for multiple architectures. PGI supports this. So you create a program that can just run on the host or run on the host plus an NVIDIA GPU if present or an AMD GPU as well. This is a good approach when you start to do commercial software.
© HPC Today 2024 - All rights reserved.
Thank you for reading HPC Today.