To pack or not to pack?

By Christian Simmendinger | July 28, 2015

Dr. Christian Simmendinger, senior HPC consultant for T-Systems – Solutions for Research

ISC 2015 is late this year. At this time of the year families usually are already packing for their summer holidays. Though this blog post is not related to holidays, but to Supercomputing, there is a direct relation between the two topics: packing. When my kids were younger, packing stuff for holiday was always a problem: While we managed to bring along the essentials (cuddly toy and tooth brush), we inevitably forgot items the kids considered indispensable (like cuddly toy number 2,3,4,5 and 6). Assembling all the required stuff took ages. Things were never where they should have been – rather they were spread across the entire flat. Over the years, packing time has constantly improved. The kids now not only know where the stuff is (still spread across the flat), but they now pack it themselves and then bundle it into their trunk in our old campervan.

How does all that translate to Supercomputing?
The answer is that in programming models – and especially in hybrid programming models – we frequently have a similar situation. Packing data for communication can take a long time, if you do it single threaded. If every thread assembles the data it owns (cuddly toys and tooth brush) and copies it into an assigned slot in the linear communication buffer, you are ready for departure much earlier.

As an example, let’s assume, we run an HPC application on an Intel Xeon Phi Processor with 240 threads. Let’s further assume that we have 3 communication neighbors (3 other Phi Processors). For a small problem size we then might have (per message buffer) up to 80 threads participating in the assembly of a single linear message buffer.

Similar to going on holidays with kids, the fastest departure here is neither achieved with the kids all traveling by themselves (splitting the communication buffer in 80 different messages) nor with a single thread (you) packing all the stuff. Instead all threads ideally pack their stuff into that linear buffer. Whenever the last contribution has arrived – you are ready to go.

There are some consequences to this simple scheme. One of these consequences is related to MPI data types. MPI data types today are fusing the packing and the send. In a highly multithreaded environment that concept becomes questionable. This is especially true, if access to (cache-coherently shared) memory becomes expensive and the last contributing thread needs to perform a single-threaded assembly (finding tooth brush, plus cuddly toys number 1-6 for 80 kids) of the linear communication buffer .

In the context of the European EXA2CT FP7 project we have implemented the various approaches to packing for the use case of a ghost cell exchange for unstructured meshes on Xeon Phi.

The difference in elapsed runtime between optimal multi-threaded packing and single-threaded packing becomes substantial very fast: For a small mesh, with just 4 Xeon Phi cards we observe that an optimal multi-threaded packing improves parallel efficiency already by a factor of up to two in terms of total application runtime.

EVENT SCHEDULE If you want to learn more about optimal multithreaded packing, the EXA2CT FP7 project, or other European Exascale projects, come and meet us at the joint European Exascale Projects booth #634. We are looking forward to lively discussions with you.

About the author
Dr. Christian Simmendinger is a senior HPC consultant for T-Systems – Solutions for Research. He currently works for the European EXA2CT FP7 project and also was the project leader of the (BMBF funded) GASPI project. Both projects aim at establishing a novel PGAS API (GASPI) for next-generation Exascale supercomputers. In 2013 – together with Rui Machado and Carsten Lojewski – he was awarded the Joseph-von-Fraunhofer Preis (Award) for his contributions to GASPI/GPI.