Verbatim: Jack Wells, Director of Science – ORNL LCF
By   |  February 12, 2014

In your opinion, how does Titan contribute to the innovation and competitiveness of the US?

Our partnership program does contribute to the innovation and competitiveness in a certain way. First, our Leadership Computing Facility is a brain magnet. It attracts very smart scientists and engineers from around the world. We think adding brain power to our laboratory helps competitiveness since the basis for any new innovation and any new idea is attracting top-notch talent. This strengthens our nation’s infrastructure. In addition to the worldwide expertise combined with our leadership computing systems, this motivates research from around the world and brings the most challenging science and engineering problems to our center. Having first class science and engineering research occur here is another way for our country to benefit.

China’s Tianhe-2 has retained its pole position in the last Top500 listing. How do you see China and the US compete for the world’s leadership in supercomputing for, say, the next five years? Would you confirm emulation is a good thing in this respect?

It’s the mission for our Leadership Computing program at DoE and the strategic intent for Oak Ridge National Laboratory to compete for world leadership in computing and computational science and engineering. This has been our focus for more than 20 years and our team is fully committed to this going forward into the future. We aspire to be world leader in this area. And this is the path we have been on.

Let me be clear: high impact scientific achievement at a larger scale is our goal. Being high on the Top500 list is one measure that can tell if we have been successful at procuring production supercomputers for our users at the appropriate large scale in order to have mission impact. It is one measure of the mission success, but not the ultimate one.

Also, the broad common interest created by the regular communication that the Top500 list gives supercomputing centers such as ours is a valuable platform for publicizing the impact of our projects. People are interested in what we’re doing on this complex and fascinating machine and the Top500 list provides a good opportunity to give our message out.

To be more specific, supercomputing centers across China made tremendous achievements, including the great leap to the number one spot on the Top500 list. It’s clear to me that they have the vision for a productive leadership role in supercomputing within their scientific and engineering enterprise. And I expect to hear about many research achievements from their supercomputing centers. Competition can be a very constructive and healthy thing as it enables the gauging of our progress towards our goals. We also expect to see increased cooperation especially in application science areas. For example China, the US and many other countries are partners in the ITER project, a large international effort to build a fusion reactor in France. This is a huge challenge. Computational plasma physicists and engineers from around the world are working on a variety of challenging problems associated with the eventual operations at ITER and the prediction and understanding of phenomena observed there.

International teams in fusion are running on Titan and on the Chinese supercomputer as well. This is very healthy. Running the same kind of problems running on different supercomputers is good because we will learn a lot from this type of applications on cutting-edge real sized problems. We have a vision for the importance of computational science and engineering and our broader science and engineering enterprise. We believe that being first class in science means being first class in computing. And it seems to me that China shares the same vision, so that’s good, that’s a confirmation that computing and computational science is important for the broader science and engineering enterprise.

Tianhe-2 is about twice as powerful as Titan. Who do you believe will reach the exascale first – in a sustainable way? And, incidentally, do you see that happen in 2020?

To be objective, China’s big leap in the Top500 list is evidence to me that they are in the best position to reach exascale first. And practically, yes, this can happen by 2020. There are many problems that have to be solved but it’s a function of investment.

The next, pre-exascale OLCF-4 project is targeting 100-200 Pflops. What hardware and software challenges are you facing for this project?

There are two big hardware and software challenges here. First is the scale of the system in terms of the number of processing cores to be managed and the extreme parallelism to be identified and effectively expressed in the scientific codes. This is typically a hardware and software challenge that refers back to our previous discussion about what the biggest challenge on Titan was. A second challenge is the resilience of the OLCF-4 system. How do we design a hardware and software to be adaptable in the face of faults? With a large number of components, how do we do that?

Is it going to be a GPU-based, hybrid architecture as well?

This is a big question. We have been working on this since over the past year. The architecture of the OLCF-4 system will not be decided until this year.

Is it going to be an upgrade of Titan or is it going to be a completely new machine?

This is going to be a completely new machine. We have reached the end of our road with the Jaguar and Titan infrastructure. We have a brand new room for preparing OLCF-4 and we will be building it while we operate Titan. And, I must add, it is an open competitive procurement.

Will it be a long term project like OLCF-3? Are you going to have the same vendor for the OLCF-4 and OLCF-5 projects?

That’s our hope. Our intent is to build another long term partnership for the future. That has the advantage of providing some measure of consistency for users in the face of tremendous changes in the technology. There are aspects of the environment now on Titan that are consistent with what we had in 2004, 2005 and 2006. Right now, everything is in place to change. So much has happened in the HPC industry. It’s kind of a brand new day…

In a 4-year time frame, we don’t really see how processor architectures will evolve, or which of the current architectural trends may take the lead. What is your vision on this hot topic and how do you anticipate this evolution with regard to application portability?

This is a true statement. There are uncertainties in the details of future architectures. We believe certain trends will hold based on current hardware development. Perhaps the most important observation is that the complexity in heterogeneity of the nodes is increasing faster than the size of the machine in terms of nodes. For software development, we first need to explicitly identify all available levels of parallelism that exist in our algorithms and we then need to appropriately map those levels of parallelism onto the available hardware architecture.

Do you see opportunities to further increase the parallelism of applications?

This is the question we asked in our most recent application requirement document. In the context of a 100 to 200 Pflops machine, where is the parallelism, how can you express it for the problem you hope to solve on that machine? The majority, not all, of our users responded that they had additional parallelism that could be expressed. Once we have identified the hardware that we will have, we will really be able to focus on how to map the parallelisms.

We all know energy efficiency is one of the main barriers to the next scale. Do you consider the programmability of millions of cores – whatever their nature – an equally difficult challenge to overcome?

Energy efficiency is primarily a hardware design challenge but it also has significant consequences for the programmability of the machine. One of the key software design issues is that a data move is expensive in terms of energy use. Sometimes you might say that computing is free. It is once you have data in the registers. Computing on the data once you have it on the processor is very much an energy efficient process. While in the past development focused on limiting data movement because of the cost in terms of time, the same is now extended to the cost in terms of energy. Going to millions of cores is in large part a data movement challenge which has to be done fast and in an energy-efficient manner. In the future, I think we shall talk both in terms of time to solution and energy to solution.

Do you think supercomputing will scale up based on today’s technologies or do you expect decisive technology breakthroughs? If so, which ones?

This actually depends on the meaning of the term. What is “a technology breakthrough”? It is hard to have a common understanding of that. No breakthrough technologies are required to build a pre-exascale system in 2017 and the exascale systems in 2022. There is a lot of research and engineering required over the next eight years to deal with the scaled power consumption, resilience and programmability challenges. Whatever is your understanding of breakthrough, whatever that means, the challenges become more severe as you try to pull that date in. Now take the OLCF-4 project. In that program, we are already two major steps down the road. No world-changing breakthrough has to be achieved in order for us to deliver a machine in 2017…

Navigation

<123>

© HPC Today 2024 - All rights reserved.

Thank you for reading HPC Today.

Express poll

Do you use multi-screen
visualization technologies?

Industry news

Brands / Products index