HPC Today | High Performance Computing

High Performance Computing – The Last Rewrite

By The Editorial Team | April 20, 2016

John Levesque is the director of Cray Supercomputing Center of Excellence at Cray Inc.

For the past 20 years, high performance computing has benefited from a significant reduction in the clock cycle of the basic processor. Going forward trends indicate that the clock rate of the most powerful processors in the world may stay the same or decrease slightly. At the same time the amount of physical space that a computing core occupies is still trending downward. This means more processing cores in can be contained within the chip.

Hybrid Core
With this paradigm shift in chip technology, caused by the amount of electrical power required to run the device, additional performance is being delivered by replicating the number of processors on the chip and reintroducing SIMD (vector) processing. The goal is to deliver more floating-point operations per watt. Interestingly, these evolving chip technologies are being used on scientific systems as small as a single workstation and as large as the systems on the Top500 list.

These new chips really are not always using new architecture; they are often expanding on the architecture we have seen over the past 10 years. The basic building blocks of high performance computers are nodes that contain shared memory multiprocessors. The number of multiprocessors on the node will increase from two to hundreds, and the SIMD length will increase from two to 10 or more. In total node performance is increasing from 20 to 100 times what the initial multicore nodes could deliver.

Changing the memory model
Another significant change in node architecture is a movement away from a flat memory model to a deep-memory hierarchy. This new memory architecture is required since bandwidth required from the new node architecture cannot be supplied by current memory architectures. Memory technology has not progressed as quickly as processor technology. An affordable way to deliver such high bandwidth is by using a memory hierarchy that has several levels of cache and a near, small, fast memory and a distant, larger, slower memory.

In the past, many application developers were satisfied using MPI across all of the processors within a node as well as across nodes, and they were not concerned about the vectorization of the lower-level loops. This approach will no longer ensure good performance. The application developer must either completely rewrite the application in a more parallel form, perhaps using Chapel (reference Brad’s Blog), or refactor the existing applications into a form that can utilize shared memory threads on the node, and ensure that the inner loops take advantage of SIMD (vector) instructions and that operands are closer to the functional units — that is, they have been stored or pre-fetched into the faster memory. An example of refactoring an all MPI application to a hybrid form is illustrated by the porting and optimization of S3D from Jaguar, the fastest system in the world in 2009 to Titan, the fastest system in the world in 2012. An overall decrease in time to solution of seven was achieved.

While shared memory parallelization and vectorization are not new concepts, today’s application developers are often unaware how best to employ shared memory parallelization or how to rewrite looping structures to take advantage of the low-level SIMD (vector) instructions. Application developers don’t often consider where their operands are stored or how best to organize their data structures.