Deep Learning Benchmarks Reflect the Success of Intel Scalable System Framework

By Rob Farber | September 15, 2016

To assist others in performing fair benchmarks and to realize the benefits of multi- and many-core performance, Intel recently announced several optimized libraries for deep and machine learning such as the high-level Intel Data Analytics Acceleration Library (Intel DAAL) and lower level Intel Math Kernel Library for Deep Neural Network (Intel MKL-DNN) that provide optimized deep learning primitives. The Intel MKL-DNN announcement also noted the library is open source, has no restrictions, and is royalty free. Even the well-established Intel Math Kernel Library (Intel MKL) is getting a machine learning refresh with the addition of optimized primitives to speed machine and deep learning on Intel architectures. More about these libraries can be seen in the Faster Machine Learning and Data Analytics Using Intel Performance Libraries video.

Big data stresses both network and storage
Big data is key to accurately training neural networks to solve complex problems as large amounts of data are required to accurately describe the problem for the neural network. How Neural Networks Work for example, shows that the neural network is actually fitting a ‘bumpy’ multi-dimensional surface, which means the training data needs to specify the hills and valleys, or points of inflection, on the surface. Not surprisingly, the preprocessing of the training data, especially using unstructured data, can be as complex a computational problem as the training itself.

Parallel distributed computing (illustrated by the mapping in Figure 3) is a necessary challenge for machine learning as even the TF/s parallelism of a single Intel Xeon and Intel Xeon Phi processor-based workstation is simply not sufficient to accurately train in a reasonable time on many complex data sets. Instead, numerous computational nodes must be connected together via high-performance, low latency communications fabrics like Intel Omni-Path Architecture (Intel OPA) and Intel Message Passing Interface (Intel MPI).

Intel Omni-Path Architecture
For data transport, the Intel OPA specifications hold exciting implications for machine-learning applications as it promises to speed the training of distributed machine learning algorithms through: (a) a 4.6x improvement in small message throughput over the previous generation fabric technology, (b) a 65ns decrease in switch latency (think how all those latencies add up across all the switches in a big network), and (c) by providing a 100 Gb/s network bandwidth ^[6] to speed the broadcast of millions of deep-learning network parameters to all the nodes in the computational cluster (or cloud) plus minimize startup time when loading large training data sets.

The Intel MPI library
MPI is a key communications layer for many scientific and commercial applications including machine and deep learning applications. In general, all distributed communications pass through the MPI API (Application Programming Interface), which means compliance and performance at scale are both critical.

The Intel MPI library provides programmers a “drop-in” MPICH replacement library that can deliver the performance benefits of the Intel OPA communications fabric plus high core count Intel Xeon and Intel Xeon Phi processors. Tests have verified the scalability of the Intel MPI implementation to 340,000 MPI ranks ^[7] where a rank is a separate MPI process that can run on a single core or an individual system. Other communications fabrics such as InfiniBand are supported plus programmers can recompile their applications to use the Intel MPI library.

As shown in Figure 3, the global broadcast of parameters to the computation node is a performance critical operation. The following graph shows how the Intel MPI team has achieved an 18.24x improvement over OpenMPI.

Machine learning is but one example of a tightly coupled distributed computation where the small message traffic generated by a distributed network reduction operation can have a big impact on application performance. The 1.34x performance improvement shown below translates to a significant time-to-model improvement simply by “dropping in” the Intel MPI library for MPICH compatible binaries (or simply recompile to transition from non-MPICH libraries like OpenMPI).

Such reduction operations are common in HPC codes, which is one of the reasons why people spend large amounts of money on the communications fabric which can account for up to 30% of the cost of a new machine. Increased scalability and performance at a lower price point explains the importance of Intel OPA to the HPC and machine learning communities as well as the cloud computing community.

The Lustre filesystem for storage
Succinctly, machine learning and other data-intensive HPC workloads cannot scale unless the storage filesystem can scale to meet the increased demands for data. This includes the heavy demands imposed by data preprocessing for machine learning (as well as other HPC problems) as well as the fast load of large data sets during restart operations. These requirements make Lustre – the de facto high-performance filesystem – a core component in any machine learning framework. The Intel Enterprise Edition for Lustre software, backed by expert support, brings the power and scalability of Lustre to the enterprise platform.

Summary
Intel SSF is designed to help the HPC community design identify the right combinations of technology for machine learning and other HPC applications. Recent benchmark and customer success stories reflect the success of the Intel SSF approach.

About the Author
Rob Farber is a global technology consultant and author with an extensive background in HPC and in developing machine learning technology that he applies at national labs and commercial organizations. He can be reached at info@techenablement.com.

[1] “HPC Balance and Commons Sense”.
[2] Using Intel’s Xeon Phi for Brain Research Visualization.
[3] Contrary View: CPUs Sometimes Best for Big Data Visualization.
[4] Farber, “Efficiently Modeling Neural Networks on Massively Parallel Computers”, NASA workshop on parallel computing, (November 1991).
[5] Vector Peak Perf: 3+TF DP and 6+TF SP Flops Scalar Perf: ~3x over Knights Corner Streams Triad (GB/s): MCDRAM : 400+; DDR: 90+; See slide 4.
[6] https://ramcloud.atlassian.net/wiki/download/attachments/22478857/20150902-IntelOmniPath-WhitePaper_2015-08-26-Intel-OPA-FINAL.pdf.
[7] See https://software.intel.com/en-us/intel-mpi-library/details.
[8] See Figure 1 in Transforming the Economics of HPC Fabrics with Intel Omni-Path Architecture.