OpenMP Device Constructs
By   |  September 04, 2015

Many-core devices, with their many small, power-efficient processing units, provide massive threading and SIMD processing. The value of this unprecedented hardware parallelism is widely acknowledged by industry, but has been slowly adopted, partly due to the lack of an open standard. Nevertheless, some device-specific and general APIs, such as CUDA, OpenACC, and OpenCL provide solutions to users for porting their codes to devices.

OpenMP is a well-known open standard for shared memory multiprocessing. Recently, the OpenMP language committee has extended the standard to include support for heterogeneous, non-shared-memory computing. OpenMP extensions now provide the ability to run code on both the host and a device in a “work sharing” manner within a single program. The execution model starts on a host processor. Sections of code encapsulated by OpenMP target directives are launched for execution on a device, while optionally allowing the host to execute in parallel with the device. The host controls all the allocation of device memory, transfer of data, queuing target executions on a queue, and managing their completion.

Significantly, OpenMP now provides a single, parallel model for threading, worksharing, device targeting, teams, and SIMD execution. A single paradigm provides a portable platform for development and a highly composable platform for integrating heterogeneous executions within a single program.

Programming Devices with OpenMP

1. Code Execution on a Target Device

In OpenMP there is a host device and a set of target devices. Program execution begins on the host device. A thread encountering a target directive on the host device will execute subsequent code statements in the target region on a target device. Variables accessed in the target region are mapped to the device and the target region is executed by the target device. By default, the host thread that encountered the target region waits for the target device to complete the execution of the target region.

  void vec_mult(int N)  {    int i;    float p[N], v1[N], v2[N];    init(v1, v2, N);  <br>    #pragma omp target    for (i=0; i<N; i++)       p[i] = v1[i] * v2[i];    output(p, N);  }

In Listing 1, the for-loop following the target directive is executed on a target device. The host thread waits for the completion of the target region and then continues with the execution of the function output. The variables p, v1, v2, I, and N are mapped to the target device at the beginning of the region and then mapped from the device at the end of the region.

If a function is called from a target region, a declaration of that function must appear between a declare target / end declare target directive pair. This tells the compiler to generate a device version of the function along with the host version.

  #pragma omp declare target  extern void fib(int N);  #pragma omp end declare target  <br>  #define THRESHOLD 1000000  void fib_wrapper(int n)  {  #pragma omp target if(n > THRESHOLD)  {  fib(n);  }  

In listing 2, the function fib() is called from a target region and the declaration of fib() appears between a declare target / end declare target directive pair.

2. The Device Data Environment

Each device has a device data environment containing the variables currently mapped to the device. In the mapping process, each variable referenced in a target construct is allocated a corresponding variable in the device data environment. By default, the corresponding variable is initialized with the value of the original variable on entry to the target region, and the original variable is assigned the value of the corresponding variable on exit from the region. (The original variable is the variable that the host sees, outside of any target device data environment.)

The mapped variable model supports both shared and distributed memory systems between host and target devices. Depending on the underlying hardware memory system, a mapped variable might require copies between host and target device memories, or no copies if the host and target device share memory. Even if memory is shared, a pointer translation of a memory coherence operation might still be required when mapping a variable. When an original variable in a host data environment is mapped to a corresponding variable in a device data environment, the mapped variable model asserts that the original and corresponding variables may share storage. Writes to the corresponding variable may alter the value of the original variable. Therefore, a program cannot assume that mapping a variable results in a copy of that variable.

2.1 Mapping variables to a device
The map clause specifies how original variables are mapped to their corresponding variables in a device data environment. The map clause has a map-type that can be used to optimize the mapping of variables. The to map-type indicates that on entry to the region the corresponding variable is initialized with the value of the original variable. The from map-type indicates that on exit from the region the original variable is assigned the value of the corresponding variable. The tofrom map-type is the default and combines the behaviors of the to and from map-types. The alloc map-type neither initializes the corresponding variable on entry to the region nor assigns the original variable on exit from the region.
extern void init(float*, float*, int);
extern void output(float*, int);

  void vec_mult(int N)  {    int i;    float p[N], v1[N], v2[N];  <br>    init(v1, v2, N);    #pragma omp target map(to: v1, v2) map(from: p)    #pragma omp parallel for    for (i=0; i<N; i++)       p[i] = v1[i] * v2[i];    output(p, N);  }  

In the map clause of listing 3, the to map-type indicates that on entry to the target region the corresponding variables in the device data environment v1 and v2 are initialized with the values of the original variables in the host’s data environment. On exit from the target region, the values are not copied and the device storage is removed. The from map-type for the variable p specifies that storage is created on the device without the values of the original variable, and on exit the corresponding (device) values are copied to the original variable storage before freeing the storage.

Navigation

<1234>

© HPC Today 2024 - All rights reserved.

Thank you for reading HPC Today.

Express poll

Do you use multi-screen
visualization technologies?

Industry news

Brands / Products index