HPC Today | Verbatim: Dan Stanzione, Acting Director

Verbatim: Dan Stanzione, Acting Director – TACC

By Stéphane Bihan | June 03, 2014

Stampede, operated by TACC, is ranked number 7 in the latest Top500 listing. It’s also one of the most powerful Xeon Phi based machines. Why did TACC choose such an architecture?

We looked at Stampede as being a sort of comprehensive system. The choice was partially made in response to the NSF’s Request for Proposals (RFP) that was divided into two parts. One was a base system that could widely support the needs of the US HPC user community. The other was a system that would allow for experimentation. We chose the Xeon Phi because we knew we would have to replace two systems with different user support requirements, i.e. our Ranger system (already planned to be decommissioned) and the Kraken system at the National Institute for Computational Sciences. Another reason was that we took advantage of a separate funding for building a machine that would not be a conservative system, something that we knew would not necessarily work for all users on day one but would help us experiment along the way.

Stampede is really a combination of two different subsystems. The first is a large-scale architecture similar to Ranger (a familiar environment for most of our users) consisting of 100,000 traditional Xeon cores with approximately 2.2 Pflops. The second is an accelerated subsystem of 7.4 Pflops. We knew there were some capabilities to experiment with that which we didn’t have in Ranger. And we integrated a large chip memory system with nodes containing 1 Tbyte of RAM, because that was a user need.

Take a tour of TACC’s Stampede, one of the most powerful Xeon Phi machine in the world…

We also integrated a type of GPU for doing visualization inside the system without having to move the data externally and then move it back. Obviously, we also put a lot of effort into creating a good file system because some applications are driven by them.

We felt that was a comprehensive system but we also knew that the central issue driving a lot of High Performance Computing systems today with Xeon Phi CPUs is the total power required. We needed something that would give us more processing capability and more power efficiency. We looked at a lot of options and very closely at GPUs, some FPGA-based solutions, and another number of things. We really felt that the Xeon Phi was the most promising in regard to the available time and the funding level. One reason is that Xeon Phi gives us this change in the power curve like the GPU does, a more power-efficient model of computing moving forward. But another reason is that we had to preserve the programming model for a large number of our users. Certainly there is work to be done to use the Xeon Phi actively, just as there is for the GPUs, but the languages and the programming tools – we primarily use OpenMP on the Phi – is very similar to what we are pushing our user community to use anyway. That was a sort of natural progression with a base system on Xeon. We knew they could use this base system from day one and then the Xeon Phi subsystem to help them migrate onto future systems.

How many nodes are attached a Xeon Phi?

We actually put a Xeon Phi in every one of the 6,400 nodes of the base system and we even added a few more because we are experimenting with some configurations of having multiple Xeon Phis on the same node. So, we have 440 nodes with two Xeon Phis attached. All together this is means 6,840 Xeon Phis in production and a few in reserve.

In this regard, out of experience, what are the pros and cons of using Xeon Phi? And, from an alternative standpoint, what made you prefer them to GPU accelerators

The pro is obviously the potential to get great performance. We have a few applications where the performance is just staggering on these new chips given the scale of the system. How much performance you can deliver is obviously the upside. The other real pro is that they preserve the traditional programming model. It’s an OpenMP MPI sort of programming model. The fact that it’s easy to get started is certainly a big pro for our users. You can move code and have it running very quickly.

The con is that coprocessors are a lot more sensitive to the quality of the code. The code has to vectorize effectively to run well on Xeon Phi. If you are running in native mode where you treat them as their own node, the con then becomes the limited amount of RAM you have available on the card. They are sitting as a PCI-based accelerator and have about 8 GB of RAM for 61 cores which is a sort of ratio that people are not used to. But this is limited on the GPU as well and as long as you’re sitting on this PCI package, you are using the more expensive I/O speed RAM, you have less floating point units than you would on a traditional CPU.

Because at the moment they are PCI accelerators, the other con is that you have to get across the PCI bus. That makes networking a little bit more complex and the file system is a little more complicated for us to set up. To ease this, we have multiple direct file systems but you don’t get exactly the same performance on the host node. There is a little bit of an unbalance here. The other option is to use the offload mode to avoid the problems of not having enough RAM or different accesses to the file systems. In that mode, part of the code runs on the host and part of the code runs on the Phi, which is actually the same as with GPUs. Of course the drawback is the work it takes to get an offload program to work correctly. You have to think about the transfers of data back and forth between the host and the node and you have to deal with the partitioning of your code between the part that runs on the host and the part that runs on the node. So, these are the cons with the Xeon Phi and the GPUs at the moment: you have to think about the offload, you have limited RAM and you are sitting on the wrong side of the PCI bus in the current packaging.

Now all that being said, one of the reasons we went with Xeon Phi over GPUs – right now GPUs have higher peak performance than Phi – is that we preserve more of the programming model and this gives our users a headstart on the Xeon Phi over the first couple of years. We probably had more codes that were ready for GPUs on day one, just because GPUs have been out longer than the Xeon Phi, but overtime we thought we could get a larger slice of the users over the Xeon Phi as the ecosystem matures.

In your opinion, what kind of improvements would lead to a more efficient use of the Xeon Phi? In other words, what barriers would you expect to break down with the next generations?

We are in a sort of transition state with all these hybrid systems. For all of them the PCI-based accelerator part was the way to get started and to deploy them quickly, but I believe that in the next several years, and all major vendors have already announced this, whether you are talking about the AMD Fusion chip or the next generation of Xeon Phi from Intel or the next generation of GPUs from NVIDIA, everyone is moving to a model where they would become the primary processor on the host or at least would be integrated on the same die.

The age of having a little PCI card with a relatively small amount of RAM attached to the main host will go away in the next couple generations of systems. This will remove the barriers of offloading back and forth, the limited RAM and the file system. All of those things will go away as we move to a sort of cleaner, not hybrid, design.

But the question is going to be: where do we get the most energy efficiency and where do we want to make that trade-off between a sort of software complexity and having the architecture take care of it? What is clear is that with any of those online processors, including the traditional ones, we will see more cores and longer vector length. We’ll be able to do a sort a finer-grain comparison once everything is self-hosted, when you can boot either a Xeon Phi node, a Xeon node, a GPU, an AMD Fusion node, the new Micron Array processor node or whatever there happens to be. You will not have to deal with these artifacts of hybrid systems that slow things down. I would encourage everybody to start thinking about more cores, more threads, and longer vectors in the code that we are writing today because all of our systems are going to look like that probably in the next two or three years.

Do you think it’s possible to extract that much parallelism from the application?

In many cases, yes. It is certainly not trivial to do, and certainly not every application will be able to scale to half a million or a million threads. But if you look at the history of parallel computing, going back several decades when clusters really started and when it was big silicon versus parallel processing, the Cray vector machine versus the Cray massively parallel processor, the same debates took place about applications being mostly serial and just a few that could be highly parallel. And, of course, the last twenty-five years of clustering showed that there was a lot of parallelism. We could not just vectorize, we could also multithread. And every time we have a big technology shift, we always say it is going to apply to just a small set of applications. Over the years, it turns out that people are clever and they find new ways to do it with a very large set of applications.

The point is that we are now doing vectors again in the Xeon Phi. People have forgotten that in the eighties, most codes were supposed to be vectorized and not very parallel. Now most of these codes are parallel but they are not very vectorizable. I don’t think the fundamental mathematics have changed a lot in the intervening time. It is true there are limits to the parallelization of smaller problems, if you are doing a genome assembly for instance, there are only so many fragments you have to assemble, you may not need a million processors to do that but you could probably use hundreds or thousands. Million cores machines are obviously just about there now but we need a million threads to program most of the top ten machines now.

Maybe not every application can get to that million-thread scale but for most of them, you can get to a hundred-thread scale with a few using these highly parallel nodes over time. It is surely not going to happen without a lot of work and there will always be applications that lag behind, but I certainly think it’s doable.

Today, hybrid architectures (CPUs + GPUs) seem to be the toast of the community, for a certain number of reasons including universality and energy-efficiency. What’s your take on that?

Energy efficiency is the driver that pushed us to these hybrid models but to me, the current definition of hybrid where you have a main processor and a PCI-based accelerator is temporary. It’s a way for us to explore these new energy efficiency technologies. And fortunately, we are doing a lot of useful work with this exploration. It’s a multi-year period and in the end, the hybrid notion of this kind of architecture will go away, quite quickly. I would be surprised if in 2017 or in 2018 any large-scale systems that are being deployed are hybrid in this sense.

I think the hybrid notion will still exist but in the silicon sense. There will be GPU-type arrays in silicon with some processing elements on the front, or maybe a mix of high-speed cores and lower-speed highly-parallel cores on the same die. But there will be one kind of silicon in the machine, one kind of node building up the cluster. We won’t have this very complicated heterogeneity that we have now and all the imbalances that come with it. As I said, even two or three years from now, that part will pass by and we will have to use more parallel sockets with more concurrency and less complexity to get the power down, which is really where both Xeon Phi and GPUs are about. The code has to be good because all of the stuff that we do in processors to protect bad code from itself has to go away to get that transistor count down. Power will still drive us, but it won’t drive us into having separate PCI cards and two kinds of processors for very much longer.

Do you think processors like the AMD Fusion or the NVIDIA Tegra that, even if they are dedicated to the mobile market, could also be used in HPC systems?

I think it is the goal for AMD and NVIDIA to move that way. Intel has announced the next generation of Xeon Phi that will also be self-hosted. With all of them, we will see this move towards putting everything on a single chip and make the nodes cheap in the HPC market with very different performance power curves than with traditional processors. The good news here is that all the work that we are doing on code, at least a lot of the work to extract many, many threads, to extract long vectors and to find the parallelism, is all going to work very well on those machines in the future. The pain will pay for importing to GPUs or Xeon Phi. Most of that work anyway is going to be preserved in any of these future systems.

<1 2 3 >

Navigation