Original Link: https://www.anandtech.com/show/14070/the-memblaze-pblaze5-c916-ssd-review
The Memblaze PBlaze5 C916 Enterprise SSD Review: High Performance and High Capacities
by Billy Tallis on March 13, 2019 9:05 AM ESTThe biggest players in the enterprise SSD market are all familiar names for followers of the consumer side of the SSD market: As it turns out, Samsung, Intel, Micron, and the other vertically-integrated NAND flash memory manufacturers are the biggest players in both the enterprise and consumer SSD markets. Subverting this trend, however, the second-tier brands for the consumer and enterprise markets have very little overlap. This occurs despite the fact that the business models between the second-tier brands are actually quite similar. Either way, second-tier fabless SSD manufacturers for both markets base their businesses around buying NAND and SSD controllers from higher tier providers, build their own drive around those components. With drives based on commodity hardware, these fabless firms rely on custom firmware to differentiate their products.
A prime example of one of these fabless companies – and the subject of today's review – is Beijing-based Memblaze. The company has made a name for itself in the enterprise space over several generations of high-end NVMe SSDs, starting in 2015 with their PBlaze4. In 2017 they released the first round of PBlaze5 SSDs, which moved to Micron's first-generation 3D TLC NAND.
Most recently, late last year a new generation of PBlaze5 SSDs with Micron's 64-layer 3D TLC began to arrive. Today we're looking at the flagship from this latest generation, the PBlaze5 C916 6.4TB SSD. In addition to its large capacity, this drive features a PCIe 3.0 x8 interface that allows for speeds in excess of 4GB/s, and a high write endurance rating that makes it suitable for a broad range of workloads. The only way to get significantly higher performance or endurance from a single SSD is to switch to something that uses specialized low-latency memory like Intel's 3D XPoint, but their Optane SSDs only offer a fraction of the capacity.
Memblaze PBlaze5 AIC Series Comparison | |||||||||
C916 | C910 | C900 | C700 | ||||||
Controller | Microsemi Flashtec PM8607 NVMe2016 | ||||||||
NAND Flash | Micron 64L 3D TLC | Micron 32L 3D TLC | |||||||
Capacities | 3.2-6.4 TB | 3.84-7.68 TB | 2-8 TB | 2-11 TB | |||||
Endurance | 3 DWPD | 1 DWPD | 3 DWPD | 1 DWPD | |||||
Warranty | Five years |
The PBlaze5 family of enterprise SSDs are all relatively high-end, but the product line has broadened to include quite a few different models. The models that start with C are PCIe add-in cards, with PCIe 3.0 x8 interfaces that allow for higher throughput than the PCIe x4 links that most NVMe SSDs are limited to. The models that start with D are U.2 drives that support operation as either a PCIe x4 device or dual-port x2+x2 for high availability configurations. Memblaze offers models in two endurance tiers: 1 or 3 drive writes per day, reflecting the trend away from 5+ DWPD models as capacities have grown and alternatives like 3D XPoint and Z-NAND have arrived to serve the most write-intensive workloads.
The add-in card models are more performance-focused, while the U.2 lineup includes both the highest capacities (currently 15.36 TB), as well as some models designed for lower power and capacities so that a thinner 7mm U.2 case can be used.
Most of the PBlaze5 family uses the Microsemi (formerly PMC-Sierra) Flashtec NVMe2016 controller, one of the most powerful SSD currently controllers on the market. The 16-channel NVMe2016 and the even larger 32-channel NVMe2032 face little competition from the usual suppliers of SSD controllers for the consumer market, though in the past year both Silicon Motion and Marvell have announced 16-channel controller solutions derived from the combination of two of their 8-channel controllers. Instead, the competition for the NVMe2016 comes from the largest in-house controllers developed by companies like Samsung, Intel and Toshiba, as well as Xilinx FPGAs that are used to implement custom controller architectures for other vendors. All of these controller solutions are strictly for the enterprise/datacenter market, and are unsuitable for consumer SSDs: the pin count necessary for 16 or more NAND channels makes these controllers too big to fit on M.2 cards, and they are too power-hungry for notebooks.
Micron's 64-layer 3D TLC NAND has consistently proven to offer higher performance than their first-generation 32L TLC, but Memblaze isn't advertising any big performance increases over the earlier PBlaze5 SSDs. Instead, they have brought the overprovisioning ratios back down to fairly normal levels after the 32L PBlaze5 drives. Those drives were rated for 3 DWPD, and as a result kept almost 40% of their raw flash capacity as spare area. The PBlaze C916 with 64L TLC, on the other hand, reserves only about 27% of the flash as spare and suffers only a slight penalty to steady-state write speeds, and no penalty to rated endurance. (For comparison, consumer SSDs generally reserve 7-12% of their raw capacity for metadata and spare area, and are usually rated for no more than about 1 DWPD.)
Our 6.4TB PBlaze5 C916 sample features a total of 8TiB of NAND flash in 32 packages each containing four 512Gb dies. This makes for a fairly full PCB, with 16 packages each on the front and back. There is also 9GB of DDR4 DRAM on board, providing the usual 1GB per TB, plus ECC protection for the DRAM.
Memblaze PBlaze5 C916 Series Specifications | |||||||
PBlaze5 C916 | PBlaze5 C900 | ||||||
Form Factor | HHHL AIC | HHHL AIC | |||||
Interface | PCIe 3.0 x8 | PCIe 3.0 x8 | |||||
Controller | Microsemi Flashtec PM8607 NVMe2016 | ||||||
Protocol | NVMe 1.2a | ||||||
DRAM | Micron DDR4-2400 | ||||||
NAND Flash | Micron 512Gb 64L 3D TLC | Micron 384Gb 32L 3D TLC | |||||
Capacities (TB) | 3.2 TB | 6.4 TB | 2 TB | 3.2 TB | 4 TB | 8 TB | |
Sequential Read (GB/s) | 5.5 | 5.9 | 5.3 | 6.0 | 5.9 | 5.5 | |
Sequential Write (GB/s) | 3.1 | 3.8 | 2.2 | 3.2 | 3.8 | 3.8 | |
Random Read (4 kB) IOPS | 850k | 1000k | 823k | 1005k | 1010k | 1001k | |
Random Write (4 kB) IOPS | 210k | 303k | 235k | 288k | 335k | 348k | |
Latency Read (4kB) | 87 µs | 93 µs | |||||
Latency Write (4kB) | 11 µs | 15 µs | |||||
Power | Idle | 7 W | |||||
Operating | 25 W | ||||||
Endurance | 3 DWPD | ||||||
Warranty | Five years |
Diving into the performance specs for the PBlaze5 C916 compared to its immediate predecessor, we see that the 6.4TB C916 should mostly match the fastest 4TB C900 model, but steady-state random write performance is rated to be about 10% slower. The smaller 3.2TB C916 shows more significant performance drops compared to the 3.2TB C900, but in terms of cost it is better viewed as a replacement for the old 2TB model. Random read and write latencies are rated to be a few microseconds faster on the C916 with 64L TLC than the C900 with 32L TLC.
The C916 is rated for the same 7W idle and 25W maximum power draw as the earlier PBlaze5 SSD. However, Memblaze has made a few changes to the power management features. The 900 series included power states to limit the drive to 20W or 15W, but the 916 can be throttled all the way down to 10W and provides a total of 16 power states to allow for the limit to be tuned in 1W increments between 10W and 25W. We've never encountered a NVMe SSD with this many power states before, and it seems to be a bit excessive.
Ultimately the lower power states don't make much sense for the C916 because most PCIe x8 slots have no trouble delivering 25W and enough airflow to cool the drive. However, the D916 in the U.2 form factor is harder to cool, and the configurable power limit may come in handy for some systems. So for this review, the C916 was run through the test suite twice: once with the default 25W power state, and once in the lowest 10W limit state to see what workloads are affected and how the drive's QoS holds up during throttling.
In addition to its flexible power management, the PBlaze5 supports several of the more advanced NVMe features that are often left out on entry-level enterprise drives. The drive supports 128 NVMe queues, so all but the largest servers will be able to assign one queue to each CPU core, allowing IO to be performed without core to core locking or synchronization. Many older enterprise NVMe SSDs we have tested are limited to 32 queues, which is less than optimal for our 36-core testbed. To complement the dual-port capability of the U.2 version, the firmware supports multipath IO, multiple namespaces, and reservations to coordinate access to namespaces between different hosts connected to the same PCIe fabric. The PBlaze5 C916 does not yet support features from NVMe 1.3 or the upcoming 1.4 specification.
The Competition
We don't have a very large collection of enterprise SSDs, but we have a handful of other recent high-end datacenter drives to compare the PBlaze5 C916 against. Most of these drives were included in our recent roundup of enterprise SSDs. The PBlaze5 C900 is the immediate predecessor to the C916, and the D900 is the U.2 version. The Micron 9100 MAX is an older drive that uses the same Microsemi controller but planar MLC NAND, so it represents the high-end from two generations back.
From Intel we have the top of the line Optane DC P4800X, and the TLC-based P4510 8TB. The P4610 would be a closer match for the C916 as both are rated for 3 DWPD, while the P4510 is better suited for comparison against the PBlaze5 C910 in the 1 DWPD segment. However, the P4510 is still based on the same 64L IMFT TLC that the PBlaze5 C916 uses, so aside from steady state write speeds the performance differences should be mostly due to the controller differences.
The two Samsung drives are both based around the 8-channel Phoenix controller that is also used in their consumer NVMe product line. The 983 DCT occupies a decidedly lower market segment than the Memblaze drives, but the 983 ZET is a high-end product with Samsung's specialized low-latency Z-NAND flash memory. Samsung's PM1725b is their current closest competitor to the PBlaze5 C916, with a PCIe x8 interface and 3 DWPD rating. However, there's no retail version of the PM1725b so samples are harder to come by.
Test System
Intel provided our enterprise SSD test system, one of their 2U servers based on the Xeon Scalable platform (codenamed Purley). The system includes two Xeon Gold 6154 18-core Skylake-SP processors, and 16GB DDR4-2666 DIMMs on all twelve memory channels for a total of 192GB of DRAM. Each of the two processors provides 48 PCI Express lanes plus a four-lane DMI link. The allocation of these lanes is complicated. Most of the PCIe lanes from CPU1 are dedicated to specific purposes: the x4 DMI plus another x16 link go to the C624 chipset, and there's an x8 link to a connector for an optional SAS controller. This leaves CPU2 providing the PCIe lanes for most of the expansion slots, including most of the U.2 ports.
Enterprise SSD Test System | |
System Model | Intel Server R2208WFTZS |
CPU | 2x Intel Xeon Gold 6154 (18C, 3.0GHz) |
Motherboard | Intel S2600WFT |
Chipset | Intel C624 |
Memory | 192GB total, Micron DDR4-2666 16GB modules |
Software | Linux kernel 4.19.8 fio version 3.12 |
Thanks to StarTech for providing a RK2236BKF 22U rack cabinet. |
The enterprise SSD test system and most of our consumer SSD test equipment are housed in a StarTech RK2236BKF 22U fully-enclosed rack cabinet. During testing for this review, the front door on this rack was generally left open to allow better airflow, since the rack doesn't include exhaust fans of its own. The rack is currently installed in an unheated attic with ambient temperatures that provide a reasonable approximation of a well-cooled datacenter.
The test system is running a Linux kernel from the most recent long-term support branch. This brings in about a year's work on Meltdown/Spectre mitigations, though strategies for dealing with Spectre-style attacks are still evolving. The benchmarks in this review are all synthetic benchmarks, with most of the IO workloads generated using FIO. Server workloads are too widely varied for it to be practical to implement a comprehensive suite of application-level benchmarks, so we instead try to analyze performance on a broad variety of IO patterns.
Enterprise SSDs are specified for steady-state performance and don't include features like SLC caching, so the duration of benchmark runs doesn't have much effect on the score, so long as the drive was thoroughly preconditioned. Except where otherwise specified, for our tests that include random writes, the drives were prepared with at least two full drive writes of 4kB random writes. For all the other tests, the drives were prepared with at least two full sequential write passes.
Our drive power measurements are conducted with a Quarch XLC Programmable Power Module. This device supplies power to drives and logs both current and voltage simultaneously. With a 250kHz sample rate and precision down to a few mV and mA, it provides a very high resolution view into drive power consumption. For most of our automated benchmarks, we are only interested in averages over time spans on the order of at least a minute, so we configure the power module to average together its measurements and only provide about eight samples per second, but internally it is still measuring at 4µs intervals so it doesn't miss out on short-term power spikes.
QD1 Random Read Performance
Drive throughput with a queue depth of one is usually not advertised, but almost every latency or consistency metric reported on a spec sheet is measured at QD1 and usually for 4kB transfers. When the drive only has one command to work on at a time, there's nothing to get in the way of it offering its best-case access latency. Performance at such light loads is absolutely not what most of these drives are made for, but they have to make it through the easy tests before we move on to the more realistic challenges.
The PBlaze5 C916 is slightly faster for random reads at QD1 than the predecessor with 32L TLC, but Intel's P4510 has even lower latency from the same 64L TLC NAND.
Power Efficiency in kIOPS/W | Average Power in W |
The C916 uses a bit less power than its predecessor, making it more efficient—but all three drives with the 16-channel Microsemi controller are still very power hungry compared to the smaller controllers in the Intel and Samsung drives. The advantages of such a large controller are wasted on simple QD1 workloads.
The PBlaze5 C916 brings a slight improvement to average read latency, but the more substantial change is in the tail latencies. The 99.99th percentile read latency is now much better than the earlier PBlaze5 drives and the Intel P4510.
The new PBlaze5 is slightly faster for small-block random reads, but the large-block random read throughput has actually decreased relative to the previous generation drive. The PBlaze5 C916 offers peak IOPS for 4kB or smaller reads, without the sub-4k IOPS penalty we sometimes see on drives like the Samsung 983 ZET.
QD1 Random Write Performance
The queue depth 1 random write performance of the PBlaze5 C916 is excellent and a big improvement over the previous generation PBlaze5. At queue depth 1, the C916 is providing about the same random write throughput that SATA SSDs top out at with high queue depths. Even Samsung's Z-NAND based 983 ZET is about 10% slower at QD1.
Power Efficiency in kIOPS/W | Average Power in W |
The power efficiency of the C916 during QD1 random writes does not stand out the way the raw performance does, but it is competitive with the drives that feature lower-power controllers, and is far better than the older drives we've tested with this Microsemi controller.
The new PBlaze5 C916 is in the lead for average and 99th percentile random write latency, but the 99.99th percentile latencies have regressed significantly—almost to the level of the Intel P4150. The earlier PBlaze5's 99.99th percentile latency was much more in line with other drives we've tested.
The PBlaze5 C916 continues the trend of very poor random write performance for sub-4kB block sizes on all drives using the Microsemi Flashtec controller. Since there's no performance advantage for small block random reads either, these drives should at the very least be shipping configured for 4k sectors out of the box instead of defaulting to 512-byte sectors, and dropping support for 512B sectors entirely would be reasonable.
As the test progresses to random write block sizes beyond 8kB, the 10W power limit starts to have an effect, ultimately limiting the drive to just under 500MB/s, less than half the throughput the C916 manages without the power limit.
QD1 Sequential Read Performance
The queue depth 1 sequential read performance of the new PBlaze5 is barely improved over the original. The PBlaze5 drives and the old Micron 9100 that uses the same Microsemi controller stand out for having exceptionally poor QD1 sequential read performance; they appear to not be doing any of the prefetching that allows competing drives to be at least three times faster at QD1.
Power Efficiency in MB/s/W | Average Power in W |
The power efficiency of the new PBlaze5 is a slight improvement over its predecessor, but given the poor performance its efficiency score is still far below the competition. In absolute terms, the total power consumption of the PBlaze5 C916 during this test is similar to the competing drives.
The QD1 sequential read throughput from the PBlaze5 C916 is quite low until the block sizes are very large: it doesn't break 1GB/s until the block size has reached 256kB, when the Intel P4510 and Samsung 983 DCT can provide that throughput with 16kB transfers. It appears likely that the PBlaze5 and the other Microsemi-based drives would continue improving in performance if this test continued to block sizes beyond 1MB, while the competing drives from Intel and Samsung have reached their limits with block sizes around 128kB.
QD1 Sequential Write Performance
At queue depth 1, many of the drives designed for more read-heavy workloads are already being driven to their steady-state write throughput limit, but the PBlaze5 drives that target mixed workloads with a 3 DWPD endurance rating aren't quite there. The newer PBlaze5 C916 is a bit slower than its predecessor that has much higher overprovisioning, but still easily outperforms the lower-endurance Intel and Samsung drives even when the 10W power limit is applied to the C916.
Power Efficiency in MB/s/W | Average Power in W |
The older PBlaze5 C900's power consumption for sequential writes was very high even at QD1, but the C916 cuts this by 25% for a nice boost to efficiency. Applying the 10W limit to the C916 brings down the power consumption much more without having as big an impact on performance, so the efficiency score climbs significantly and surpasses the rest of the flash-based SSDs in this review.
As with random writes, the PBlaze5 C916 handles sequential writes with tiny block sizes so poorly that sub-4kB transfers probably shouldn't even be accepted by the drive. For 4kB through 16kB block sizes, the newer PBlaze5 is a bit faster than its predecessor. For the larger block sizes that are more commonly associated with sequential IO, the C916 starts to fall behind the C900, and the 10W power limit begins to have an impact.
Peak Random Read Performance
For client/consumer SSDs we primarily focus on low queue depth performance for its relevance to interactive workloads. Server workloads are often intense enough to keep a pile of drives busy, so the maximum attainable throughput of enterprise SSDs is actually important. But it usually isn't a good idea to focus solely on throughput while ignoring latency, because somewhere down the line there's always an end user waiting for the server to respond.
In order to characterize the maximum throughput an SSD can reach, we need to test at a range of queue depths. Different drives will reach their full speed at different queue depths, and increasing the queue depth beyond that saturation point may be slightly detrimental to throughput, and will drastically and unnecessarily increase latency. Because of that, we are not going to compare drives at a single fixed queue depth. Instead, each drive was tested at a range of queue depths up to the excessively high QD 512. For each drive, the queue depth with the highest performance was identified. Rather than report that value, we're reporting the throughput, latency, and power efficiency for the lowest queue depth that provides at least 95% of the highest obtainable performance. This often yields much more reasonable latency numbers, and is representative of how a reasonable operating system's IO scheduler should behave. (Our tests have to be run with any such scheduler disabled, or we would not get the queue depths we ask for.)
One extra complication is the choice of how to generate a specified queue depth with software. A single thread can issue multiple I/O requests using asynchronous APIs, but this runs into at several problems: if each system call issues one read or write command, then context switch overhead becomes the bottleneck long before a high-end NVMe SSD's abilities are fully taxed. Alternatively, if many operations are batched together for each system call, then the real queue depth will vary significantly and it is harder to get an accurate picture of drive latency. Finally, the current Linux asynchronous IO APIs only work in a narrow range of scenarios. There is work underway to provide a new general-purpose async IO interface that will enable drastically lower overhead, but until that work lands in stable kernel versions, we're sticking with testing through the synchronous IO system calls that almost all Linux software uses. This means that we test at higher queue depths by using multiple threads, each issuing one read or write request at a time.
Using multiple threads to perform IO gets around the limits of single-core software overhead, and brings an extra advantage for NVMe SSDs: the use of multiple queues per drive. The NVMe drives in this review all support at least 32 separate IO queues, so we can have 32 threads on separate cores independently issuing IO without any need for synchronization or locking between threads.
The Memblaze PBlaze5 C916 performs well enough for random reads that it is largely CPU-limited with this test configuration. Reaching the rated 1M IOPS would require applications to use asynchronous IO APIs so that each thread can issue multiple outstanding random read requests at a time, drastically reducing the software overhead of system calls and context switches. That kind of rearchitecting is something that few application developers bother with given the current limitations of asynchronous IO on Linux, so the CPU-limited numbers here are a realistic upper bound for most use cases.
Despite the software overhead, the newer PBlaze5 is able to offer a marginal performance improvement over its predecessor and other TLC-based drives, but it doesn't come close to matching the Samsung Z-SSD. The 10W power limit has a small impact on performance on this test, but even with the limit in place the PBlaze5 C916 is still outperforming the Intel P4510 and Samsug 983 DCT.
Power Efficiency in kIOPS/W | Average Power in W |
The upgrade to 64L TLC allows the newer PBlaze5 to save several Watts during the random read test, bringing its efficiency score almost up to the level of the Intel P4510. The Samsung drives with the relatively low-power Phoenix 8-channel controller still have the best performance per Watt on this test, with both the TLC-based 983 DCT and Z-NAND based 983 ZET significantly outscoring all the other flash-based drives and even beating the Intel Optane SSD.
In its default power state, the PBlaze5 C916 still suffers from the abysmal 99.99th percentile latency seen with older drives based on the Microsemi controller. However, when the 10W power limit is applied, the drive's performance saturates just before the CPU speed runs out, and the score reported here reflects a slightly lower thread count. That causes the tail latency problem to disappear entirely and leaves the PBlaze5 C916 with better throughput and latency scores than the Intel and Samsung TLC drives.
Peak Sequential Read Performance
Since this test consists of many threads each performing IO sequentially but without coordination between threads, there's more work for the SSD controller and less opportunity for pre-fetching than there would be with a single thread reading sequentially across the whole drive. The workload as tested bears closer resemblance to a file server streaming to several simultaneous users, rather than resembling a full-disk backup image creation.
The peak sequential read performance of the Memblaze PBlaze5 C916 is essentially unchanged from that of its predecessor. It doesn't hit the rated 5.9GB/s because this test uses multiple threads each performing sequential reads at QD1, rather than a single thread reading with a high queue depth. Even so, the C916 makes some use of the extra bandwidth afforded by its PCIe x8 interface. When limited to just 10W, the C916 ends up slightly slower than several of the drives with PCIe x4 interfaces.
Power Efficiency in MB/s/W | Average Power in W |
The drives with the Microsemi 16-channel controller are unsurprisingly more power hungry than the Intel and Samsung drives with smaller controllers, but the newest PBlaze5 uses less power than the older drives without sacrificing performance. The Samsung drives offer the best efficiency on this test, but the PBlaze5 C916 is competitive with the Intel P4510 and its performance per Watt is only about 18% lower than the Samsung 983 DCT.
Steady-State Random Write Performance
The hardest task for most enterprise SSDs is to cope with an unending stream of writes. Once all the spare area granted by the high overprovisioning ratios has been used up, the drive has to perform garbage collection while simultaneously continuing to service new write requests, and all while maintaining consistent performance. The next two tests show how the drives hold up after hours of non-stop writes to an already full drive.
The steady-state random write performance of the PBlaze5 C916 is slightly lower than the earlier C900; the inherently higher performance of Micron's 64L TLC over 32L TLC is not quite enough to offset the impact of the newer drive having less spare area. Putting the drive into its 10W limit power state severely curtails random write throughput, though it still manages to outperform the two Samsung drives that generally stay well below 10W even in their highest power state.
Power Efficiency in kIOPS/W | Average Power in W |
The newer PBlaze5 C916 uses several Watts less power during the random write test than its predecessor, so it manages a higher efficiency score despite slightly lower performance—and actually turns in the highest efficiency score among all the flash-based SSDs in this bunch. The 10W limit drops power consumption by 36% but cuts performance by 68%, so under those conditions the C916 provides only half the performance per Watt.
99.99th percentile latency is a problem for the PBlaze5 C916, and when performing random writes without a power limit, the average and 99th percentile latency scores are rather high as well. Saturating the full-power C916 with random writes requires more threads than our testbed has CPU cores, so some increase in latency is expected. The poor 99.99th percentile latency when operating with the 10W limit is entirely the drive's fault, and is a sign that Memblaze will have to work on improving the QoS of their lower power states if their customers actually rely on this throttling capability.
Steady-State Sequential Write Performance
The lower overprovisioning on the newer PBlaze5 takes a serious toll on steady-state sequential write performance, even though the newer 64L TLC is faster than the 32L TLC used by the first-generation PBlaze5. However, it still performs far faster than the Intel and Samsung drives designed for about 1 DWPD with even lower OP than the PBlaze5 C916 with 3 DWPD. The Samsung 983 ZET has plenty of write endurance thanks to its SLC Z-NAND, but without lots of spare area the slow block erase process bottlenecks its write speed just as badly as it does on the TLC drives.
Power Efficiency in MB/s/W | Average Power in W |
The PBlaze5 C916 uses much less power on the sequential write test than the earlier PBlaze5, but due to the lower performance there isn't a significant improvement in efficiency. The 10W power limit actually helps efficiency a bit, because it yields a slightly larger power reduction than performance reduction.
Mixed Random Performance
Real-world storage workloads usually aren't pure reads or writes but a mix of both. It is completely impractical to test and graph the full range of possible mixed I/O workloads—varying the proportion of reads vs writes, sequential vs random and differing block sizes leads to far too many configurations. Instead, we're going to focus on just a few scenarios that are most commonly referred to by vendors, when they provide a mixed I/O performance specification at all. We tested a range of 4kB random read/write mixes at queue depths of 32 and 128. This gives us a good picture of the maximum throughput these drives can sustain for mixed random I/O, but in many cases the queue depth will be far higher than necessary, so we can't draw meaningful conclusions about latency from this test. As with our tests of pure random reads or writes, we are using 32 (or 128) threads each issuing one read or write request at a time. This spreads the work over many CPU cores, and for NVMe drives it also spreads the I/O across the drive's several queues.
The full range of read/write mixes is graphed below, but we'll primarily focus on the 70% read, 30% write case that is a fairly common stand-in for moderately read-heavy mixed workloads.
Queue Depth 32 | Queue Depth 128 |
At the lower queue depth of 32, the PBlaze5 drives have a modest performance advantage over the other flash-based SSDs, and the latest PBlaze5 C916 is the fastest. At the higher queue depth, the PBlaze5 SSDs in general pull way ahead of the other flash-based drives but the C916 no longer has a clear lead over the older models. The Intel P4510's performance increases slightly with the larger queue depth but the Samsung drives are already saturated at QD32.
QD32 Power Efficiency in MB/s/W | QD32 Average Power in W | ||||||||
QD128 Power Efficiency in MB/s/W | QD128 Average Power in W |
As usual the latest PBlaze5 uses less power than its predecessors even before the 10W limit is applied, but on this test that doesn't translate to a clear win in overall efficiency. The Intel Optane SSD is the only one that really stands out with great power efficiency on this test, and compared to that the TLC drives all score fairly close to each other for efficiency, especially at the lower queue depth.
QD32 | |||||||||
QD128 |
The 10W limit has a significant impact on the PBlaze5 C916 through almost all portions of the mixed I/O tests. With or without the power limit, the C916 performs lower than expected at the pure read end of the test, but follows a more normal performance curve through the rest of the IO mixes. The performance declines as more writes are added to the mix are relatively shallow, especially in the read-heavy half of the tests. The older PBlaze5 drives with more extreme overprovisioning hold up a bit better on the write-heavy half of the test than the C916.
Aerospike Certification Tool
Aerospike is a high-performance NoSQL database designed for use with solid state storage. The developers of Aerospike provide the Aerospike Certification Tool (ACT), a benchmark that emulates the typical storage workload generated by the Aerospike database. This workload consists of a mix of large-block 128kB reads and writes, and small 1.5kB reads. When the ACT was initially released back in the early days of SATA SSDs, the baseline workload was defined to consist of 2000 reads per second and 1000 writes per second. A drive is considered to pass the test if it meets the following latency criteria:
- fewer than 5% of transactions exceed 1ms
- fewer than 1% of transactions exceed 8ms
- fewer than 0.1% of transactions exceed 64ms
Drives can be scored based on the highest throughput they can sustain while satisfying the latency QoS requirements. Scores are normalized relative to the baseline 1x workload, so a score of 50 indicates 100,000 reads per second and 50,000 writes per second. Since this test uses fixed IO rates, the queue depths experienced by each drive will depend on their latency, and can fluctuate during the test run if the drive slows down temporarily for a garbage collection cycle. The test will give up early if it detects the queue depths growing excessively, or if the large block IO threads can't keep up with the random reads.
We used the default settings for queue and thread counts and did not manually constrain the benchmark to a single NUMA node, so this test produced a total of 64 threads scheduled across all 72 virtual (36 physical) cores.
The usual runtime for ACT is 24 hours, which makes determining a drive's throughput limit a long process. For fast NVMe SSDs, this is far longer than necessary for drives to reach steady-state. In order to find the maximum rate at which a drive can pass the test, we start at an unsustainably high rate (at least 150x) and incrementally reduce the rate until the test can run for a full hour, and the decrease the rate further if necessary to get the drive under the latency limits.
The performance of the PBlaze5 C916 on the Aerospike test is a bit lower than the older C900 delivered, but is still well above what the lower-endurance SSDs can sustain. Even with a 10W limit, the C916 is still able to sustain higher throughput than the Intel P4510.
Power Efficiency | Average Power in W |
The power consumption of the C916 is lower than the C900, but the efficiency score isn't improved because the performance drop roughly matched the power savings. The C916 is still more efficient than the competing drives on this test when is power consumption is unconstrained, but with the 10W limit its efficiency advantage is mostly eliminated.
Conclusion
When Memblaze updated their PBlaze5 SSDs with newer 64-layer 3D TLC NAND, they could have left everything else more or less the same and the result would likely have been a new generation of drives with improved performance and power efficiency across the board. Instead, Memblaze decided to rebalance the product line a bit, using the improved performance of Micron's second-generation 3D NAND as an opportunity to rein in the rather high overprovisioning ratios that the PBlaze5 initially used. The refreshed PBlaze5 models offer more usable capacity for the same amount of raw flash memory on the drive, without sacrificing much performance. The difference really adds up for high-capacity drives: with the OP ratio used on the first generation, our 6.4TB PBlaze5 C916 would have instead had a usable capacity of only 5.3 TB.
The PBlaze5 C916 also retains the same 3 drive writes per day (DWPD) endurance rating as the older PBlaze5 C900, which puts these drives in one of the highest endurance tiers that still uses mainstream high-capacity TLC NAND flash memory.
Cutting down on spare area reserved for the drive's internal use usually has a big impact on steady-state write speeds. For the PBlaze5 this impact is reduced by the switch to faster flash memory, but the newer C916 still loses some write performance in most of our tests. Even when it does not match the performance of its predecessor, the PBlaze5 C916 clearly offers a higher class of write performance than competing drives in the 1 DWPD market segment.
The switch to newer 3D NAND flash memory allows the C916 to use much less power than the C900, which helps offset the relatively high baseline power consumption of the massive 16-channel SSD controller Memblaze uses. The Intel and Samsung drives we compared against use smaller controllers that give them an advantage in power consumption, but now that Memblaze is using similar NAND, the PBlaze5 can come out ahead in power efficiency whenever the workload is heavy enough to make use of the higher NAND channel count and wider PCIe interface.
On top of the more efficient NAND, the new PBlaze5 comes with richer power management capabilities than we have encountered in any other datacenter SSD, with a power limit that can be adjusted from the default 25W down to 10W in 1W increments. Our tests of the PBlaze5 C916 in its 10W power state brought it down to similar peak power levels as the competing drives with smaller controllers. This throttling didn't affect every test; at low queue depths and for very read-oriented workloads the C916 was already comfortably below 10W. Write speed is severely constrained by the reduced power limit, but even at 10W the C916 still generally offers better steady-state write performance than the drives in the 1 DWPD market segment.
The adjustable power limit doesn't make a lot of sense for a big add-in card SSD like the PBlaze5 C916, but may be a useful feature for the U.2 versions. It seems like Memblaze has done a good job of implementing this capability without undue sacrifice to overall performance or QoS. High-density deployments of the U.2 versions that may not be able to offer enough airflow to manage 15-20W per drive can still benefit from most of the performance offered by the PBlaze5.
The biggest improvement the new PBlaze5 brings over its predecessors is one we are unfortunately not in a position to accurately quantify. The PBlaze5 C916 with 64L TLC costs much less than the PBlaze5 C900 with 32L NAND. The higher density NAND is cheaper to produce, prices have crashed over the past year due to excellent supply conditions across the market, and the newer PBlaze5 generation has higher usable capacities for the same raw capacity of flash. All told, this means the volume price of our 6.4TB C916 is probably significantly lower than what our 4TB C900 was going for; but of course, those prices are almost never made public for enterprise and datacenter drives that aren't sold through retailers.
There's no doubt that the newer Memblaze PBlaze5 is a much better overall value than the previous generation. It isn't better in every way, but it makes smart tradeoffs and stays in the same market segment. We'd have to test some more direct competitors with comparable endurance ratings and the same PCIe x8 interface to know whether the PBlaze5 C916 is the best high-end TLC drive currently available, but there are only a few other products out there that aspire to offer this combination of performance, capacity and endurance. The PBlaze5 C916 clearly stands above the more mainstream product segments and should be taken seriously as a competitor at the very high end of the SSD market.