On the agenda:
• ORNL, behind the scene
• The implications of adopting GPU acceleration
• The roadmap for the successors of Titan
• The benefits of the US-China competition for supercomputing leadership
Interview by Stéphane BIHAN
Mr Wells, can you briefly describe what Oak Ridge National Lab is and tell us about your role in this organization?
Oak Ridge National Laboratory is the US Do–E’s largest science laboratory with full class capabilities in material science and engineering, nuclear science and technology, neutron scattering technologies and computing and computational sciences. Our Computing and Computational Sciences directorate is one of the five science directorates at the Laboratory and each directorate has approximately 600 employees. Globally, ORNL counts for about 3,300 staff.
The lab inherited from the Manhattan project during World War II. The vision that emerged post World War II was at the intersection of material science and technology, and nuclear science and technology. Within the last fifteen years, the lab has really grown capabilities in neutron science and technology, and computation science and technology. The mission of the Oak Ridge Leadership Computing Facility (OLCF) is to procure and operate the most capable production supercomputing facility as we can and make these capabilities available for high impact research by US industry, universities and other federal agencies. In addition to our user facility, a second one, funded by the National Science Foundation and led by our partner, the University of Tennessee, is located here. There are also two research divisions: one is the Computer Science and Mathematics Division and the second one is the Computer Science and Engineering Division.
Today, in the OLCF, we operate Titan, the number 2 supercomputer in the world on the Top500 list. We operate it for users and access is granted through open and competitive peer-reviewed allocation programs. As Director of Science, my role is to engage the user community, to ensure high-quality, cost-effective science outcome, to manage the allocation program and user policies, to collect user requirements for future procurement and to ensure the effective communication of research results.
Titan has been in operation since the beginning of 2013. How would you describe the process of building such a giant machine? How long did it take from initial sketching to end users delivery?
The process of building a supercomputer is actually a big part of what we do. In January 2009, Ray Orbach, then Director of the DoE’s Office of Science in Washington (that is, the lead of all the science investments in DoE), signed a mission need statement for a 10 to 20 Pflops upgrade of the OLCF system. The January 2009 date is really when the OLCF-3 project, i.e. the Titan project, started. In December 2009, the next step was the approval of the Acquisition Strategy and Cost Range. We proposed a set of alternatives to accomplish the mission need and we then began to make progress on meeting it. This all fits within a very structured project management process that had to undergo peer-review and management approval. The next step came in August 2011 when the OLCF received the approval from DoE to order the system that was selected, a Cray XK7.
Did you discuss with other vendors?
The path that was chosen for the OLCF-3 procurement was actually an upgrade of Jaguar rather than a brand new machine. That is why it became a sole source procurement.
What was the overall budget for building this flagship? Can you also give us an estimate (and breakdown) of its operation costs?
This is in the order of a hundred millions of dollars. There are three main elements in the operating budget: one is the lease for the supercomputer. We don’t purchase the machine, we amortize the cost over a period of 4 or 5 years through a lease. Therefore, the cost of the machine becomes part of the operating budget. The other two elements are the cost of the electricity and the cost for our staff.
What is the power consumption of Titan?
If you look at the amount of power Titan consumes when it’s running compute intensive applications like the Linpack benchmark, it is about 8.2 MW as indicated in the top500 list. In terms of cooling you have about an additional 20%, which makes something close to 10 MW for Titan proper. But our computer facility has a total power supply of 25 MW as we also run other computers and petascale systems. There is the Kraken machine we operate for the National Science Foundation, another petascale machine that we operate for NOAA, and other laboratory scale servers and workstations and clusters.
Why did you choose a hybrid, GPU-based architecture? Was NVIDIA influential in ORNL’s choice?
This is actually the result of our application requirement process. From discussions, interviews and surveys with our user community back in 2009, we determined that the top requirement for a new system or a system upgrade was to deliver more raw flops in order to perform the needed science simulations. More raw flops on the node ended up being the number one requirement, although some codes were starting to reach limits in their ability just to scale out in a flat MPI space. Therefore, rather than just adding more nodes or more cores as we grew the size of the Jaguar machine, we decided to make the nodes more powerful and flop rich. Then, from our early discussions with NVIDIA, we came convinced that GPUs would be able to satisfy that requirement for many of our applications.
The community thinking at the time, and I think it’s even stronger today four years on, was that in order to reach exascale, the nodes would have to be more heterogeneous. That seems to be a strong trend. We thought that this is what the future would likely be and we decided that it would be a good role for a leadership computing facility to step out and take that on.
Did you see any risk in adopting GPUs?
Yes, the main risk we identified and worked to manage was that our codes would not be able to run on that machine and effectively use the GPUs.
So… was porting some codes part of the evaluation process?
Actually we had the application readiness review where our strategy for getting codes ready to run was peer reviewed. If we had not passed that peer review, we probably wouldn’t have been be able to proceed with the project. But we did. We were able to convince the peer review panel that we had a good strategy for getting the code ready.
Which technical strategies have been adopted by ORNL to migrate applications from CPU parallelism to GPU acceleration?
This is an important point. As I said, earlier in the project, we had to present our strategy. A huge part of that strategy on the application itself concerned forming a Center for Accelerated Application Readiness (CAAR). And that was formed as part of the Titan project as a way to manage this risk, to manage the task of having some user codes ready to go on Titan. Through CAAR, we focused on a number of application codes from our user programs. We tried to select them for their diversity, so that their requirements in algorithm would represent a good variety of computation science problems. The codes selected included LAMMPS, a molecular dynamic code outside the National Lab, S3D, a combustion code, LSMS, a material science code from Oak Ridge, Denovo DM, a neutron radiation transport code for nuclear reactors which is being developed here, and also CAM-SE which is a community atmosphere model with spectral elements code that is part of a big climate code.
How was CAAR structured?
There was a team assigned to each application that included an application lead from OLCF application group, a Cray engineer, an NVIDIA development engineer and other application developers. If we needed someone from another university or from another laboratory, we would bring them on contract. We formed little teams to port the code for each application. On average, the cost was one to two person-year of work to take a code that was running on Jaguar and get it ready to go on Titan.
Now, we did not take on the task of porting the whole code because so many of these codes would be community codes that have a lot of different components used for a lot of different problems. What we did was trying to select any science problem that would be both manageable and attractive. And that’s an important concept when you are doing something like this, just trying to bring focus so that you can be successful within a finite amount of time and a finite budget. Of course, along with benchmarks, these codes were also part of the acceptance suite for the machine.
Throughout that process, there were lessons to learn. One is that in order to get the code ready for a specific accelerator architecture, there are major code restructuring required. This code restructuring took 70 to 80% of the effort. It has nothing to do with any specific programming model for offloading to the GPU: if that were an Intel MIC processor, we believe the same 80% of work would have been necessary. It has to do with preparing the code to work well in the context of this hierarchical parallelism.
In other words, you modified the algorithms rather than the science?
That’s right, and this was related to how the memory was structured in the machine. For instance, some arrays in the code had to be reordered, arrays of lists had to become lists of arrays, different kind of things like this, so that when you get to the bottom of this memory hierarchy, you will have a chunk of well-organized work to head on to the GPU accelerator.
This is really a combination of various expertise: science, numerical engineering, processor architecture, etc. We started to do that work on Jaguar, the old Cray XT5. And almost every one of the codes went two times faster. You see, working on code is good. Thinking hard about the way codes will run on this big machine is good. Sometimes, codes can be run in a production mode for a while and no one takes a hard look at them. That was therefore a good outcome. We also believe that this restructuring actually envisions things well for the future. And this is why you need to do it whether you use CUDA, OpenCL or OpenACC. As I said earlier, even if you want to move this to a MIC accelerator based machine or something else like this in the future, this kind of restructuring is really important. And it is going to be essential for the next generation machines.
© HPC Today 2019 - All rights reserved.
Thank you for reading HPC Today.