Big data applications handle extremely large datasets that present challenges of scale. High-performance IT infrastructure is necessary to achieve very fast processing throughput for big data. Solid state drives (SSDs) based on NAND Flash memory are well-suited for big data applications because they provide ultra-fast storage performance, quickly delivering an impressive return on investment. SSDs can be deployed as host cache, network cache, all-SSD storage arrays, or hybrid storage arrays with an SSD tier.
Depending on the big data application, either enterprise-class or personal storage SSDs may be used. Enterprise SSDs are robust and durable, offering superior performance for mixed read/write workloads, while personal storage SSDs typically cost less and are suitable for read-centric workloads. It is important to understand the workload performance and endurance requirements before making a decision.
The Big Picture of Big Data
In a scenario where data grows 40% year-over-year, where 90% of the world’s data was created in the last two years, and where terabytes (TBs) and petabytes (PBs) are talked about as glibly as megabytes and gigabytes, it is easy to mistakenly think that all data is big data. In fact, big data refers to datasets so large they are beyond the ability of traditional database management systems and data processing applications to capture and process. While the exact amount of data that qualifies as big is debatable, it generally ranges from tens of TBs to multiple PBs. The Gartner Group further characterizes it in terms of volume of data, velocity into and out of a system (e.g., real-time processing of a data stream), and variety of data types and sources. Big data is typically unstructured, or free-floating, and not part of a relational database scheme.
Examples of big data applications include:
- Business analytics to drive insight, innovation, and predictions
- Scientific computing, such as seismic processing, genomics, and meteorology
- Real-time processing of data streams, such as sensor data or financial transactions
- Web 2.0 public cloud services, such as social networking sites, search engines, video sharing, and hosted services
The primary reason for implementing big data solutions is productivity and competitive advantage. If analyzing customer data opens up new, high-growth market segments; or if analyzing product data leads to valuable new features and innovations; or if analyzing seismic images pinpoints the most productive places to drill for oil and gas—then big data is ultimately about big success.
Big data presents challenges of extreme scale. It pushes the limits of IT applications and infrastructure for processing large datasets quickly and cost-effectively. Many technologies and techniques have been developed to meet these challenges, such as distributed computing, massive parallel processing (e.g., Apache Hadoop), and data structures that limit the data required for queries (e.g., bitmaps and column-oriented databases). Underlying all of this is the constant need for faster hardware with greater capacity because big data requires fast processing throughput, which means faster, multicore CPUs, greater memory performance and capacity, improved network bandwidth, and higher storage capacity and throughput.
SSDs for Ultra-Fast Storage
SSDs have emerged as a popular choice for ultra-fast storage in enterprise environments, including big data applications. SSDs offer a level of price-to-performance somewhere between DRAM and hard disk drives (HDDs).
SSDs are an order of magnitude denser and less expensive than DRAM, but DRAM has higher bandwidth and significantly faster access times. Compared to HDDs, SSDs offer orders of magnitude faster random I/O performance and lower cost per IOPS, but HDDs still offer the best price per gigabyte. With capacity pricing for Flash memory projected to fall faster than other media, the SSD value proposition will continue to strengthen in the future.
- Exceptional Storage Performance – Deliver good sequential I/O and outstanding random I/O performance. For many systems, storage I/O acts as a bottleneck, while powerful, multicore CPUs sit idle waiting for data to process. SSDs remove the bottleneck and unleash application performance, enabling true processing throughput and user productivity.
- Nonvolatile – Retain data when power is removed; no destaging required, like DRAM.
- Low Power – Consume less power per system than equivalent spinning disks, reducing data center power and cooling expenses.
- Flexible Deployment – Available in a unique variety of form factors and interfaces compared to other storage solutions:
- Form factors: Half-height, half-length (HHHL), 2.5-inch, 1.8-inch, mSATA, m.2, etc.
- Interfaces: PCIe, SAS, and SATA
SSD Deployment Options
- Host Cache – SSDs reside in the host server and act as a level-2 cache for data moved out of memory. Intelligent caching software determines which blocks of data to hold in cache. Typically, PCIe SSDs are used because they offer the lowest latency because no host controllers or adapters are involved. Best results are achieved for heavy read workloads. Cache may be read-only or write-back. Redundant SSDs are recommended for write-back to ensure data is protected.
- Network Cache – Similar to host cache, except SSDs reside in a shared network appliance that accelerates all storage systems behind it. Out-of-band cache is read-only, while in-band is write-back. Network cache offers a better economic benefit because it is shared, but it can be slower than direct host cache.
All-SSD Storage Array – An enterprise storage array that uses Flash for storage and DRAM for ultra-high throughput and low latency. All-SSD arrays offer features like built-in RAID, snapshots, and replication traditionally found in enterprise storage. They may include technologies like inline compression and deduplication to shrink the data footprint and maximize SSD efficiency. This option provides additional management of SSDs as it relates to wearout over the entire array.
SSD Tier in a Hybrid Storage Array – A traditional enterprise storage array that includes SSDs as an ultra-fast tier in a hybrid storage environment. Automated storage management monitors data usage and places hot data in the SSD tier and cold, or less-frequently accessed, data in high-capacity, slower HDD tiers to optimize storage performance and cost. This option works well for mixed data, some of which requires very high performance. A variation on hybrid storage is when an SSD is incorporated as secondary cache in the storage controller’s read/write cache.
More around this topic...
In the same section
© HPC Today 2020 - All rights reserved.
Thank you for reading HPC Today.