Bringing Lustre Relevance to the Enterprise

By Ken Strandberg | August 14, 2015

High Performance Computing (HPC) technologies are coming to the enterprise to help with Big Data as well as the emerging HPC in the enterprise workloads. Common to these workloads are exponentially expanding data requirements. So, enterprises are facing one of their biggest challenges—efficiently serving up those massive amounts of data to the compute complex. Company IT departments are experiencing bottlenecks in their storage I/O, network I/O, or both with their existing NFS and local file systems. The file systems need to be more performant to keep up with the workload demands, but scaling NFS systems is difficult.

There is a lot of complexity around managing a bunch of NFS servers, breaking up the namespace across servers, which potentially results in isolated storage domains. “One company in the Financial Service Industry,” said Brent Gorda , General Manager of Intel’s High Performance Data Division, the group that develops Intel Lustre* software, “has a hundred NFS servers, and users have to first remember on which server their data resides before they can even start processing it.” Some organizations may attempt to mask this complexity with a complicated tangle of automount maps; this normalizes the namespace to a certain degree, but does not solve the problem of unequal distribution of data and IO workloads across the storage domain. The logical solution is a clustered approach with a parallel file system, and Lustre is the leading parallel file system in performance and scalability. So enterprises are looking at Lustre. But it has had a reputation of not being very enterprise-friendly.

Equipping Lustre for the Enterprise
“Lustre is an open source project, but Intel is a driving force behind it, having led every community release since 2010, starting with Whamcloud,” stated Bret Costelow, Director of Global Sales for Lustre Solutions at Intel. “Each year, Intel developers continue to spend significant work investing in specific features that make Lustre more relevant and appealing in enterprise workloads.”

“Big Data and HPC storage in the enterprise is not the same as in academia and the big labs,” according to Malcolm Cowe, product manager for Intel Enterprise Edition for Lustre software. “In the labs, there might be six or more full-time engineers managing their Lustre file system. Enterprises can’t afford that level of dedicated resources.” In the enterprise, file systems have to be easily managed, reliable, and available with minimal attention. And, enterprise IT leans heavily on automation across their infrastructure. “Storage systems need to run without interruption and intervention as much as possible.” To address enterprise requirements, Intel has done some interesting things to and with Lustre.

There is more to standing up an enterprise-class 50 petabyte, 2 terabyte/second Lustre file system with 50,000 drives than just acquiring the servers and installing the open source software. For mission critical workloads, reliability, availability, and even disaster recovery need to be built into the solution, so there is no downtime. For example, Australia’s Bureau of Meteorology relies on data being always accessible for numerical weather prediction to assist air traffic, fire suppression planning, and weather emergencies, where lives can depend on its availability 24/7.

While Intel and the Lustre community have created highly stable software, enterprise-class reliability is largely based on investment in system hardware. Intel Lustre solution architects work with a large partner ecosystem to develop high availability building block modules based on established design patterns, so there’s no single point of failure in the server infrastructure. The partners then deliver the solution to their customers. “Investing in and working with a partner ecosystem and providing a credible support infrastructure through that partnership is a critical part of our reliability and serviceability model,” remarked Costelow.

Polishing Lustre’s Undeserved Reputation
“Lustre has an undeserved reputation for being difficult,” stated Cowe. “It comes from an open source history, where the focus is on those writing the code rather than those consuming it. So, we have made and continue to make Lustre as accessible as possible with features like Intel Manager for Lustre (IML).” IML is a web-based interface that makes it straightforward to install and easy to manage the system. The software takes IT industry best practices based on established hardware system design patterns and automatically deploys the file system for highly reliable and available operation.

“Lustre was originally architected to quickly serve massively large data sets—in the gigabyte to terabyte range—in a single file,” explained Gorda. “By working in parallel, the object storage servers feed up that data incredibly fast and efficiently. With those types of workloads, Lustre did not need high performance meta data servers.” In Life Sciences and the Financial Services Industry (FSI), it is a different situation, one for which Lustre was not originally designed.

In Life Sciences and FSI, the large data sets can be made up of a massive number of smaller files. For example, in genomics, sequencers can create millions of files of a few hundred megabytes that are joined together into a 6 TB data set describing a specie’s genome. The file system must find these millions of files through meta data accesses in order to create the entire data set. Until recently, Lustre’s upper limit of about 60k file creates per second made it less efficient for these types of workloads. With the latest release of Lustre with the added feature of parallel metadata servers, the file system has scaled out to over 1 million file creates per second, giving Life Sciences and FSI a significant performance benefit. According to Malcolm Cowe, the Intel team and the open source Lustre community continue to innovate methods for even greater scalability for these types of workloads. “We’re working on different striping methods, distributed transactions, and a project called Data on Meta data to enhance even further small file performance on Lustre,” remarked Cowe.

Enabling Lustre for Big Data and the Cloud
The performance Lustre delivers is ideal for Big Data workloads and the Cloud.

“Everybody associates Big Data with Hadoop,” claimed Gorda. “But there’s more to Big Data. We support Hadoop, but we go well beyond it.” This is good news for HPC users with a Lustre file system who want to try Hadoop without scaling out an entire Hadoop platform with replicated local storage. Intel created an Hadoop interface to HPC job schedulers, like Slurm, so the Hadoop job looks like an HPC job. And they have written a file system interface to Hadoop that takes out the Hadoop Distributed File System (HDFS) and puts Lustre in, effectively removing the need for local storage and opening the door to running Map/Reduce with Lustre. PayPal uses Lustre and Big Data for real-time fraud detection. “Intel’s work with Lustre on Big Data is a huge enabling for our HPC customers who want to run Hadoop workloads on their data,” said Gorda.

HPC is emerging in the Cloud, so that companies who temporarily need massively parallel computing capacity can take advantage of it. Intel has enabled these companies to leverage Lustre in the cloud with their release of Intel Cloud Edition for Lustre software. Amazon Web Services (AWS) uses the Intel cloud version to offer high-performance, scalable storage using Lustre in their Elastic Compute Cloud (EC2). AWS is able to deploy a production Lustre file system in ten to 15 minutes, according to their web site. SAS, the Business Analytics software company, delivers clustered analytics services through the Amazon Web Services Marketplace and recommends using Lustre on AWS for their analytics services.

Teaming with the Community to Go Beyond Traditional Lustre Usages
With Intel’s and the community’s work, Lustre now also supports Hierarchical Storage Management for customers who need to balance their requirements for performance, scalability, and capacity. And they do a lot of work to integrate other protocols, such as SMB using Samba, and NFS, in order to mix Lustre with other networks.

There’s a lot of interest by enterprises to integrate more security functionality into Lustre. Intel and the Lustre community are working on developing access controls with SELinux to provide fine-grained secure access to data by applications. “And we’re also looking at Kerberos to do authentication and authorization of nodes, plus over the wire network encryption,” indicated Cowe.

“The fact that the majority of sites we work with, and the majority of the community, have moved forward to a 2.5.x Lustre code base is a strong endorsement of the advances we’ve made over the past several years to add enterprise-class features and stability to the code,” said Gorda. “And there’s more to come.” Intel is working on the next generation of storage technology to support Exascale computing. “With upcoming Intel technologies and the software we’re working on, we’ll be able to support not terabytes, but petabytes per second,” stated Gorda.

“Today, we are seeing signs of encouragement in Lustre’s ability to provide value for enterprise environments,” said Gorda. “We believe this is the tip of the iceberg and definitely a sign of great things to come for the Lustre community.”

Ken Strandberg is a technical story teller. He writes articles, white papers, seminars, web-based training, video and animation scripts, and technical marketing and interactive collateral for emerging technology companies, Fortune 100 enterprises, and multi-national corporations. Mr. Strandberg’s technology areas include Software, HPC, Industrial Technologies, Design Automation, Networking, Medical Technologies, Semiconductor, and Telecom.