Power and Cooling: The Sword of Damocles?

By Steve Conway | September 08, 2015

Consistently ranked as the number two concern for HPC data centers, power and cooling face big unknowns

Fifteen years ago, power and cooling didn’t make the top 10 list of issues HPC data centers were facing. That changed quickly with the rise to dominance of clusters and other highly parallel computer architectures, starting in the period 2000 to 2001 and escalating from there. In IDC’s worldwide surveys since 2006, power and cooling have consistently ranked as the number two concern for HPC data centers, right behind the perennial quest for bigger budgets.

Spending Is Not Racing Ahead Yet

Despite this elevated concern, during the period 2006 to 2013, the portion of HPC budgets devoted to power and cooling held steady at between eight and nine percent on average. True, the average budget increased substantially during this period, so the eight to nine percent figure was larger in 2013 than in 2006, and larger in 2006 than in 2000. But the absolute increases in spending on power and cooling pale in comparison with the explosive growth in the rated performance of HPC systems since 2000, especially supercomputer-class systems. Between April 2000 and September 2013, the average peak performance of systems on the Top500 list skyrocketed from 154 GF to 652 TF, a factor of 4,217, while average TOP500 Linpack performance jumped 4,373-fold, from 102 GF to 446 TF.

Most Sites Are Managing Well

Rampant growth in HPC system sizes and processing power has not caused power and cooling budgets to spiral out of control yet, for a number of reasons.

HPC vendors have made their technologies and products substantially more power-efficient. The list of technologies contributing to improved power efficiency is long. It includes x86 and other base processors, along with GPGPUs and Intel Phi, both of which are notably energy efficient but have limited applicability today. Another important contributor is the trend toward liquid cooling, which is several times more energy-efficient than air cooling. Liquid cooling appears in many guises, from water-chilled doors to exotic immersive implementations.
HPC sites have been on a tear updating their power and cooling infrastructures or building new ones. In a recent IDC worldwide study, two-thirds of the HPC sites had budgets in place to upgrade their power and cooling capabilities, to the average tune of about $7 million. Not surprisingly, government sites typically have the largest systems and are under the greatest energy and spatial pressure. Industrial sites are least constrained — half of them never see their energy bills, because someone else in the company receives and pays them.
By and large, HPC data centers, with the help of vendors, have been coping so far with rising requirements for power and cooling, even at an average cost of about $1 million per megawatt. The real question is what the future holds.

The Sword of Damocles?

As we approach the exascale era, is the power and cooling issue a disruptive sword of Damocles hanging over the HPC community, or will it continue to be manageable?

The study I cited earlier asked HPC users and vendors whether they expected any revolutionary advances in power and cooling technology in the next five years. The users said no, the vendors said yes, and they were referring in most cases to the same advances (because the vendors had briefed the users about their plans).

The users are less optimistic about breakthroughs, but most do not seem heavily concerned yet, except on one important point: potential tradeoffs between productivity and power efficiency. Tradeoffs frequently mentioned by users include pressure to overbuy energy-efficient, harder-to-program coprocessors and accelerators; and accepting more service disruptions because of shorter upgrade cycles to deploy more energy-efficient systems.

But these tradeoffs seem manageable in the sense that they are based on decisions users will make. Even the vaunted goal of fitting an exascale computer into a 20 MW power envelope is a matter of time, of when rather than if. It might happen in 2020, 2024 or a different year, but it will happen.

The Bigger Unknowns

Much less certain is when deeper, more integral energy-efficiency capabilities will become available. These are considerably more challenging than squeezing a peak exascale into a 20 MW package, however difficult that may be. These deeper capabilities will go a long way toward making exascale and lesser extreme-scale computing a reasonable proposition for funders and users alike.

In particular, sophisticated power management (“power steering”) will be needed throughout the system to dynamically shift power to where it’s needed at every moment. Both hardware and software will need to be able to “learn” about power needs on the fly. Achieving this goal will require large investments to develop software that can power profile and power-steer many elements of the system, including:

• Cores and processors
• The interconnect and network interface
• The storage system
• The operating system, programming model and entire software stack
• Application codes (power-aware applications)

The HPC community is capable of developing these and other needed capabilities, given enough time, money and personnel. So, the real question, as with so many major HPC undertakings, is when will these necessary elements come together in sufficient quantity? If the past is any guide, a large chunk of the funding will need to come from government sources. And for that to happen, it has become increasingly clear that the HPC community will need to make a strong case for the returns government funders can anticipate from major HPC investments like this.

This discussion so far assumes that HPC data centers will have access to enough reliable energy, even in the exascale era. That is not a given. Today’s largest HPC systems already consume as much electricity as a small city, and their exascale successors promise to devour more, even with expected advances in energy-efficiency. Some of the biggest HPC data centers worry that their local power companies may balk at fully supplying their future needs. A few sites have “plan B” scenarios in place, in which they go off the grid and build small nuclear reactors. And in some parts of the world, reliable access to adequate power is already a major challenge for HPC data centers — an important reminder that power and cooling are concerns not only for sites marching toward exascale capacity, but for most HPC sites.

Fellow Travelers

In their pursuit of energy efficiency at extreme scale, HPC sites will likely have fellow travelers in the form of major Internet players. A pattern is already forming in which these companies locate new data centers in geographical areas where power is comparatively cheap and plentiful. Google set the tone more than five years ago, by building a vast new data center along the Columbia River near Oregon’s Dalles Dam, with its 1.8-gigawatt power station and relatively inexpensive hydroelectric power. A prominent HPC example is Oak Ridge National Laboratory, whose power appetite is fed by the Tennessee Valley Authority. The lab’s data center hosts multiple petascale systems from DOE, NSF and NOAA. The biggest HPC and Internet data centers will likely have much to learn from each other in the coming years.