Interview with Adrian Tate, Cray EMEA Research Lab

By Nages Sieslack | May 31, 2016

The convergence of high performance computing (HPC) with big data has become a critical focus of many commercial and research organizations that increasingly must accommodate both types of applications. As a result, vendors are grappling with the best way to design and provide such systems to their customers. One of the questions that arises is the contention between using distinct systems optimized for either HPC or big data, versus using a single system flexible enough to serve both application sets.

We recently spoke with Adrian Tate of Cray to get an idea of how the iconic supercomputer maker is approaching this convergence dynamic. Tate, who is the director of the Cray EMEA Research Lab (CERL), is preparing to speak about this topic at the upcoming ISC High Performance conference, in Frankfurt Germany. We asked him to elaborate on the factors driving HPC/big data usage and to identify the critical technologies that will be needed to satisfy these customers.

1. What is the relationship between HPC and Big Data? From a technology standpoint, which one is driving the other?

HPC and Data Analytics evolved quite independently, being driven by completely different factors and largely not concerned with each other. The reason we speak about their intersection is because we’re at a point where these largely independent fields both face significant new challenges and each field has something to offer the other in terms of solutions. HPC is about to be inundated with vast amounts of data, from new or next-generation instruments and to ignore the lessons already learned around the handling of massive data-sets and their mining would be folly. Likewise, the complexity and diversity of certain data analytics problems drives a need for higher performance data analytics, where the lessons learned by the HPC community are highly relevant. At Cray we can contribute a lot to this discussion. Although we are known mainly for Supercomputing, we have also been selling analytics systems for many years, have deep expertise and strong customers in both areas and have been exploring the intersection for some time through co-design projects.

2. Workloads that are compute-intensive, by definition, have rather different hardware requirements than those that are data-intensive. Providing a system that can do both efficiently is bound to come with some trade offs. Is it reasonable to try to build such converged systems or does it make more sense to keep the systems separate and provide a software framework to manage the different workloads on different hardware?

Cray recently formed its first Research Lab. This was a response to the need to answer questions like the one you ask in order to understand how to use future systems efficiently. We don’t perform pure research – we investigate which technologies and which configurations are really going to be useful to our customers in the near future. So I would start by saying the question of how to configure systems for mixed workloads is an open research problem that we are looking at in the context of codesign projects, along with dozens of other related questions in this area. That said, we do believe strongly that future systems start to look extremely interesting from a mixed-workload perspective. Cray’s Shasta system will combine the flexibility of clusters with the reliability and environment of high-end supercomputers. This will be the first supercomputer to allow you to truly mix and match node types – different types of CPU and accelerators – and to configure the memory hierarchy to support your specialist workload. In this setup then, you can envisage machine partitions that are well suited to your simulation, analytics, machine-learning or visualisation needs all supported by a homogeneous, high-performance environment and toolset. Many mixed–workload models will be well suited to this partitions model without the trade-offs you mention playing into it. This doesn’t mean that we are ignoring in-situ analytics of course – some customers want to interleave simulation and analytics on the same compute nodes. Here the shared infrastructure needs to support two differing workloads and trade-offs are inevitable. Flexible technology, especially memory hierarchy remains the key but by definition here a single system is the only option.

By the way, we don’t have to wait for years to see this in action. Cray will soon release a system codenamed Athena that is optimized for data analytics but includes an environment and network to support HPC, so you can see elements of this convergence already taking place.

3. What kinds of system software, development tools, and middleware will be important in supporting compute- and data-intensive application in these mixed environments?

Probably too much to mention here so I’ll mention three examples of SW challenges we are investigating with respect to near-term, medium-term and longer-term issues. In the near-term the software stacks of the two fields need to at worst co-exist. Each area has solved certain challenges from its own perspective – for example, the Spark and Hadoop communities have approached workload management with resilience and portability being the key constraints, while in HPC this problem has been solved from the flexibility and simplicity perspectives. At some level these tools will have to play together nicely, and in the longer term should be unified into a single software layer to avoid clutter.

In the medium-term there are data movement problems that are common to non-HPC workloads such as visualization and analytics and that require significant R&D. For example, the ability to migrate a very large data-set from one distributed partition of a system to another (distributed) partition requires significant non-existing software support to be performed seamlessly. We are working on such an environment as part of the Human Brain Project PCP and expect this to be generally useful in the future. We also see a need for improved support for data streaming, data management and long-term storage and for significant improvements to external networks so that huge data sets can be moved into and out of the system. Data will have to be reduced on the fly, and/or processed near-storage (ideas that are being explored in SAGE).

Looking further out, when mixed workloads can coexist to some extent then it will be important to unify their programming models. This is a huge challenge because programming of analytics and HPC have also been approached completely differently. Analytics has the luxury of abstracting the hardware from the user entirely and presenting instead an interface optimised for information retrieval. HPC has been largely focused on programming models that do the opposite – represent to some degree the underlying hardware. The main HPC programming construct – a C/Fortran array – actually represents reasonably closely the data layout in DRAM, which was great when DRAM was the only relevant memory space but is increasingly outdated now. Recently programming abstractions have been introduced that separate the data layout from the data structures presented to the user (e.g. Kokkos) and this would seem to be a must-have abstraction in future languages. But even with that separation, we still have to understand how to decompose data problems onto a complex memory hierarchy. We are looking at how to express basic data decompositions on multi-level memory hierarchies as a mathematical optimization problem so that the runtime system can help solve this problem. In data analytics, the abstraction is already there, but the underlying runtime support for efficient memory hierarchy usage will be just as important.

4. New memory technologies and configurations are becoming commercially available for performance-demanding applications. Can you give us an overview of what you think are the most important ones and their value to these applications?

High bandwidth memory / stacked DRAM will be a game-changer in HPC, providing significant bandwidth improvements over DRAM at lower energy cost. This will affect diverse workload performance also – analytics and visualisation will surely benefit from 1TB/s data access, though there is limited capacity and a latency consideration. DIMM-based non-volatile memory as well as PCI attached non-volatile memories are very interesting to analytics and visualisation because of the vast capacity improvements they offer over DRAM, meaning that some large dataset problems can be run in–core. Memory vendors are speaking about stacking NV RAM analogously to how we are now stacking DRAM, making capacities giant by today’s standards. The best way to understand how to configure and use this variety of memories for diverse workloads is codesign and we are working with customers to jointly explore this.

5. With regard to Cray, does the company fundamentally see the Big Data opportunity as a way to expand beyond its HPC user base, or to expand its footprint within that base?

It is definitely both. Cray has a new and growing community of High-Performance Data Analytics customers. Likewise highly configurable systems are of interest to HPC customers that are looking to expand the role of their centres or to deal with increased volumes of scientific data. Luckily for us, customers on all sides are really keen to work with us to solve research problems and to design future systems.