Processor evolution: what to prepare application codes for?

By François Bodin, Henri Calandra and Alain Refloc'h | April 28, 2014

GLOSSARY

Algorithm libraries: A set of routines providing frequently used computations (e.g. BLAS – Basic Linear Algebra Subroutines). Library implementations are usually highly optimized and specific to a given system.

Accelerator: A piece of specialized hardware which performs some tasks efficiently. Accelerators can be programmable but may not implement a general purpose instruction set architecture (ISA).

Arithmetic intensity: The number of operations performed per word of memory transferred. A high ratio means a good use of the numerous computing units of many-core processors and a minimal demand on the memory systems.

AVX: Intel AVX is a SIMD instruction set extension to SSE designed for Floating Point (FP) intensive computations.

Big Data: A large collection of data sets that is so complex that it is difficult to work on using traditional data processing platforms. Big data is frequently associated with the three Vs: Variety (range of data types and sources), Volume (amount of data), and Velocity (speed of data in and out).

Cache Memory: A cache is a smaller and faster memory space that stores the data being reused (temporal locality) close to the processing unit. Caches exploit spatial locality by loading blocks of contiguous data (cache line) and by prefetching potentially used data.

Checkpoint Restart: Runtime technique consisting in saving a program memory state to allow restarting it at a given execution step (typically after a fault or a timeout).

Clock Frequency: The rate in cycles per second (measured in GigaHertz) of the clock of a processor core.

Code Architecture: Global organization of the data structures and computation of a code.

Compute Node: A set of processing units connected to a main memory.

Core: An integrated circuit that implements an instruction set architecture (ISA).

Data Affinity: The concept of allocating workloads according to data localization in a memory hierarchy. Data affinity aims at minimizing data movements within the system.

Data locality: Refers to the access to contiguous data. Data locality is usually divided in temporal locality (where a memory location can be accessed multiple times in a relatively small amount of time) and spatial locality (where data is stored contiguously so that memory is accessed in sequence).

Data Parallelism: A simple form of parallelism that consists in performing a given computation in parallel on a set of data. The computation is usually performed on the processing unit where the data resides (owner compute rule).

Dependencies (Data): The sequential execution of a program defines an order in which operations/instructions are executed. Parallel implementations of sequential codes must respect this (partial) order between reads and writes to preserve its original semantic. As a consequence, the parallelization of a serial code starts by identifying the “data dependencies” in the code.

Directives: Extra information added as comments into a program to be used by the compilation process.

Efficiency: For P parallel resources, the efficiency is T1 / (P ∗ Tp) where T1 is the execution time using one serial process and Tp is the execution time using P parallel resources.

Exascale: Computation scale capable of reaching 10¹⁸ FLOPs and processing 10¹⁸ bytes of storage.

GPGPU: General Purpose Graphics Processing Unit, i.e. GPU for scientific computing.

Heterogeneous Architecture: Compute node or processor architecture having processing units with different characteristics such as speed and instruction sets.

Hybrid Programming: Programming approach in which multiple parallel APIs are used in the same program to address various aspects of the target architecture. The most common form of hybrid programming combines MPI with OpenMP. When dealing with GPUs, this comes with the use of GPU programming APIs such as CUDA, OpenCL or HMPP in OpenMP or MPI codes.

Irregular Reduction: Computation of a set of scalar reductions that are typically stored in an array. Also known as “histogram reduction”.

Many-Core: A processor having a large number of processing units.

Memory Bandwidth: The rate (usually expressed in GigaByte per second) at which data can be read from or stored to memory. This is usually THE performance bottleneck.

Memory Hierarchy: The organization of a system comprised of a main memory and cache memories.

Message-Passing: Communication scheme between parallel tasks based on messages. The most widely used method for scientific computing is MPI (Message Passing Interface).

Multi-core Processor: A processor having more than one computing core. Such a processor is also referred to as Chip Multi-Processor (CMP).

Multi-Threading: A technique consisting in executing multiple instruction threads efficiently.

MPI: Message Passing Interface. MPI is a widely available open standard that implements a library of message passing functions between computing units. MPI is designed for massively parallel architectures.

Non-Uniform Memory Access: Term used when the speed of a memory access depends on the location/address of data.

NVM: Non-Volatile Memory.

OpenMP: A directive-based API for thread-based parallel computing. OpenMP assumes that memory is shared between threads (within a shared address space).

Power Wall: Power demand that prevents CPU clock frequency from being increased.

Reductions: Global commutative and associative reductions (such as additions and multiplications) that can be implemented in an efficient parallel scheme. Parallel implementations of reductions do not respect data dependencies and as a consequence modify the output of the code.

Runtime Libraries: Set of routines in charge of managing computer resources during the execution of a program. Runtime library rely on operating systems and/or device drivers.

SIMD: Single Instruction Multiple Data. This technique means performing an operation on multiple data in parallel. Vector computers as well as microprocessors with Intel-SSE or IBM-Altivec instruction extensions exploit SIMD techniques.

SIMT: Single Instruction Multiple Threads [33].

Speedup: Performance improvement in parallel systems is stated in terms of speedup, which is the ratio between the serial execution time and the optimized/parallelized execution time.

Scalability: Property of a parallel code / algorithm. A code is said to be scalable if its efficiency ratio remains close to 1 when the number of parallel computing resources (e.g. cores, processes) is increasing. Scalability is rarely achieved when increasing the number of parallel processes without increasing the amount of work to perform.

Serialization: Data structure serialization consists in converting data elements into a byte stream, thus allowing it to be written to some storage means.

Vectorization: The process of translating a set of serial pro- gram statements into vector instructions.

<1 2 3 4 >

Navigation