Original Link: https://www.anandtech.com/show/15491/enterprise-nvme-hynix-samsung-dapustor-dera
Enterprise NVMe Round-Up 2: SK Hynix, Samsung, DapuStor and DERA
by Billy Tallis on February 14, 2020 11:15 AM ESTLast fall, several enterprise SSD vendors reached out to us around the same time, offering review samples of their latest and greatest. We put together an updated test suite for enterprise and datacenter SSDs and spent more than a month hammering the drives. Our review of two SATA drives was published first, but this review of 9 NVMe drives is what we've really been looking forward to. These multi-TB drives show just how far NVMe can go beyond the limits of SATA and SAS SSDs.
Two of the products we're looking at today come from familiar manufacturers. Samsung is the dominant player in the SSD market, shipping more drives than the next three companies combined. We have their PM1725a in house: an older flagship model, but still the fastest we've ever tested with almost twice the random read performance of an Intel Optane SSD. SK Hynix sent over their PE6011, a low-power entry-level datacenter U.2 drive that is part of their strategy to reestablish a foothold in market segments where they have faltered in recent years.
We also have two new brands featured in one of our reviews for the first time. DapuStor and DERA are two Chinese drive manufacturers that have been around for a few years but have until recently been focusing on their domestic market. DERA's strategy is more centered around developing home-grown technology to compete with foreign suppliers by designing their own SSD controller. DapuStor worked with familiar names like Marvell and Kioxia/Toshiba to create datacenter SSDs focused on efficiency, while also pursuing a long-term roadmap toward advanced in-house tech.
Nine new drives adding up to 40TB of high-end storage might seem like a lot, but it's barely enough to to cover the breadth of the enterprise SSD market. No two of these models are in direct competition. Enterprise SSD product segments can be defined in terms of form factor, write endurance, capacity and performance. Different use cases will call for a different kind of drive, and there's no one size fits all solution.
Reviewed Models Overview (Drives Tested in Bold) |
|||||||
Model | Interface | Form Factor | Capacities | Memory | Write Endurance (DWPD) |
||
DapuStor Haishen3 H3000 |
PCIe 3.0 x4 | 2.5" 15mm U.2 | 1 TB 2 TB 4 TB 8 TB |
96L 3D TLC | 1 DWPD | ||
DapuStor Haishen3 H3100 |
PCIe 3.0 x4 | 2.5" 15mm U.2 | 800 GB 1.6 TB 3.2 TB 6.4 TB |
96L 3D TLC | 3 DWPD | ||
DERA D5437 |
PCIe 3.0 x4 | 2.5" 15mm U.2 | 2 TB 4 TB 8 TB |
64L 3D TLC | 1 DWPD | ||
DERA D5457 |
PCIe 3.0 x4 | 2.5" 15mm U.2 | 1.6 TB 3.2 TB 6.4 TB |
64L 3D TLC | 3 DWPD | ||
SK Hynix PE6011 |
PCIe 3.0 x4 | 2.5" 7mm U.2 | 960 GB 1.92 TB 3.84 TB 7.68 TB |
72L 3D TLC | 1 DWPD | ||
Samsung PM1725a |
PCIe 3.0 x8 | HHHL AIC | 1.6 TB 3.2 TB 6.4 TB |
48L 3D TLC | 5 DWPD | ||
Previously Reviewed by AnandTech: | |||||||
Micron 5100 MAX |
SATA | 2.5" 7mm | 240 GB 480 GB 960 GB 1.92 TB |
32L 3D TLC | 5 DWPD | ||
Samsung 883 DCT |
SATA | 2.5" 7mm | 240 GB 480 GB 960 GB 1.92 TB 3.84 TB |
64L 3D TLC | 0.8 DWPD | ||
Samsung 983 DCT |
PCIe 3.0 x4 | 2.5" 7mm U.2 | 960 GB 1.92 TB |
64L 3D TLC | 0.8 DWPD | ||
Intel DC P4510 |
PCIe 3.0 x4 | 2.5" 15mm U.2 | 1 TB 2 TB 4 TB 8 TB |
64L 3D TLC | 0.7–1.1 DWPD | ||
Intel Optane DC P4800X |
PCIe 3.0 x4 | HHHL AIC | 375 GB 750 GB 1.5 TB |
3D XPoint | 60 DWPD | ||
Memblaze PBlaze5 C916 |
PCIe 3.0 x8 | HHHL AIC | 3.2 TB 6.4 TB |
64L 3D TLC | 3 DWPD | ||
Note: Tested capacities are shown in bold |
To provide some more meaningful comparisons, we've retested and included several other enterprise SSDs from previous reviews.
Drives In Detail: Samsung & SK hynix
Samsung PM1725a
The Samsung PM1725a is a flagship high-end model that's a few generations old. Like the Micron 5100 MAX that was in our recent SATA review, we're testing this because MyDigitalDiscount has a batch of the 6.4TB model that they're selling for just 19 cents/GB - less than many high-end consumer NVMe drives.
Despite its age, the 6.4 TB Samsung PM1725a is by several metrics the fastest SSD we have ever tested, and one of only two drives in our collection that can hit more than 1 million IOPS for 4kB random reads. The PM1725a is also rated for 5 drive writes per day, more than most current-generation high-end enterprise SSDs, and even more the successor PM1725b (3 DWPD) that replaced 48L 3D TLC with 64L TLC and slightly increased performance.
Samsung PM1725a SSD Specifications | |||
Form Factor | U.2 | PCIe Add-In Card | |
Controller | Samsung S4LP049X01 "EPIC" | ||
Form Factor | 2.5" 15mm U.2 | PCIe HHHL AIC | |
Interface, Protocol | PCIe 3.0 x4 NVMe 1.2 |
PCIe 3.0 x8 NVMe 1.2 |
|
Capacities | 800 GB 1.6 TB 3.2 TB 6.4 TB |
1.6 TB 3.2 TB 6.4 TB |
|
NAND Flash | Samsung 512Gbit 48L 3D TLC | ||
DRAM | Samsung 8Gbit DDR3-1866 | ||
Sequential Read | 3.3 GB/s | 6.2 GB/s | |
Sequential Write | 3.0 GB/s | 2.6 GB/s | |
Random Read | 800k IOPS | 1000k IOPS | |
Random Write | 160k IOPS | 180k IOPS | |
Power Draw |
Max | 23 W | 21 W |
Idle | 8 W | 7.5 W | |
Write Endurance | 5 DWPD | 5 DWPD | |
Warranty | 5 years |
A big part of why the PM1725a is so fast is that there's simply a lot of SSD here. The controller (labeled "EPIC") is massive, with a PCIe x8 uplink and 16 channels for interfacing with the NAND. The usable capacity is 6.4 TB, but the drive has 8 TB of flash onboard, meaning that this drive has more internal spare area than the usable capacity of the smallest SSD in this review. All these chips and the 8-lane uplink require a PCIe add-in card form factor, with a large heatsink to dissipate over 20W.
The PM1725a is a little bit outdated by only supporting version 1.2 of the NVMe spec, but it implements almost all of the optional features, including support for multiple namespaces and SR-IOV virtualization so this massive drive can be shared among several virtual machines with minimal overhead.
SK hynix PE6011
SK hynix currently holds fifth place in the overall SSD market (by drives shipped), but they're one of only three companies that is fully vertically integrated: they make their own 3D NAND, DRAM, controller, firmware and SSDs. Now that their 3D NAND seems to be catching up with the rest of the market, that vertical integration may help them significantly improve their standing in the marketplace, but they have fairly low visibility in some important market segments. They have seen the most success in the client OEM SSD market and have reentered the consumer retail market, but enterprise and datcenter SSDs are where the best profit margins can usually be found.
SK hynix PE6000 Series NVMe SSD Specifications | |||||||
Model | PE6011 | PE6031 | |||||
Controller | SK hynix SH58800GG | ||||||
Form Factor | 2.5" 7mm U.2 | ||||||
Interface, Protocol | PCIe 3.0 x4 NVMe 1.3a | ||||||
Capacities | 960 GB | 1.92 TB | 3.84 TB | 7.68 TB | 800 GB 1.6 TB 3.2 TB 6.4 TB |
||
NAND Flash | SK hynix 512Gbit 72L 3D-V4 TLC | ||||||
DRAM | SK hynix DDR4 | ||||||
Sequential Read | 3.2 GB/s | 3.2 GB/s | |||||
Sequential Write | 650 MB/s | 1250 MB/s | 2.3 GB/s | 2.45 GB/s | 2.45 GB/s | ||
Random Read IOPS | 220k | 410k | 620k | 610k | 620k | ||
Random Write IOPS | 27k | 50k | 67k | 70k | 160k | ||
Power Draw |
Read | 8.0 W | 8.0 W | 8.5 W | 10.0 W | 10 W | |
Write | 6.0 W | 8.0 W | 12.0 W | 14.0 W | 14 W | ||
Idle | 3.5 W | 3.7 W | |||||
Write Endurance | 1 DWPD | 3 DWPD | |||||
Warranty | 5 years |
The SK hynix PE6011 and its sibling PE6031 are low-power datacenter NVMe SSDs, using the same 2.5"/7mm form factor as consumer SATA drives, but with a U.2 connector to provide a PCIe 3.0 x4 interface. Most enterprise and datacenter U.2 drives instead use a 15mm thick case to allow for stacked PCBs or more cooling and higher power levels. Using their 72-layer 3D TLC NAND, SK hynix can still pack up to 8TB of storage into this case (7.68 TB usable) along with an 8-channel controller of their own design plus the necessary power loss protection capacitors. The lower power limit of the thinner 2.5" form factor does mean the PE6011 is a bit more limited in performance than most of the drives in this review, but there are lots of other SSDs out there for this product segment that we haven't had the chance to test.
The PE6031 (not tested) is pretty much the same hardware as the PE6011, but with a higher overprovisioning ratio: more spare area, less usable capacity. That allows the PE6031 to target more write-heavy workloads with twice the random write performance and an endurance rating of 3 DWPD instead of 1 DWPD. The PE6011 targets the larger market of SSDs for read-intensive workloads.
DapuStor Haishen3
Shenzhen DAPU Microelectronics Co.—DapuStor for short— has a fairly ordinary business model for a smaller player in the enterprise SSD market: partner with a controller supplier and a NAND supplier, and write some custom firmware. In the case of the DapuStor Haishen3 series, the controller comes from Marvell and the NAND is from Kioxia (formerly Toshiba Memory).
The DapuStor Haishen3 drives use Kioxia/Toshiba BiCS4 96-layer 3D eTLC, the most advanced NAND flash included in this review. We're testing both the H3000 and H3100 versions: 1 DWPD for read-intensive workloads, or 3 DWPD for more write-heavy mixed IO. DapuStor touts the power efficiency of the Haishen3 drives as one of the key selling points. As far as advanced features, the drives include support for multiple NVMe namespaces, and dual-port PCIe support for high availability when connected through appropriate backplanes. DapuStor is also willing to supply custom firmware enabling other advanced features like SR-IOV virtualization, streams/IO determinism, and Key-Value or Zoned namespace support. However, it's not clear how ready those options are—the Haishen3 is a very new product, and our samples were slightly delayed so that we could get the latest firmware.
DapuStor Haishen3 Series Specifications | ||||||||||||
Model | H3000 | H3100 | ||||||||||
Controller | Marvell 88SS1098 Zao | |||||||||||
Form Factor | 2.5" 15mm U.2, HHHL AIC |
HHHL AIC |
2.5" 15mm U.2, HHHL AIC |
HHHL AIC |
||||||||
Interface, Protocol | PCIe 3.0 x4 NVMe 1.3 | |||||||||||
Capacities (TB) | 1 | 2 | 4 | 8 | 16 | 0.8 | 1.6 | 3.2 | 6.4 | 12.8 | ||
NAND Flash | Kioxia (Toshiba) 512Gbit BiCS4 96L 3D TLC | |||||||||||
DRAM | Nanya 8Gbit DDR4-2666 | |||||||||||
Sequential Read (GB/s) | 3.5 | 3.5 | ||||||||||
Sequential Write (GB/s) | 1.3 | 2.4 | 3.3 | 3.4 | 1.3 | 2.4 | 3.3 | 3.4 | ||||
Random Read IOPS | 450k | 710k | 450k | 710k | ||||||||
Random Write IOPS | 38k | 60k | 108k | 99k | 140k | 98k | 170k | 335k | 300k | 465k | ||
Power Draw (Watts) |
Typical | 6.5 | 8.5 | 6.5 | 8.5 | |||||||
Max | 9.5 | 13 | 15 | 9.5 | 13 | 15 | ||||||
Write Endurance | 1 DWPD | 3 DWPD | ||||||||||
Warranty | 5 years |
The Marvell 88SS1098 "Zao" controller is their 8-channel solution for enterprise/datacenter SSDs, bringing lots of features that were missing from the 1093/1092 "Eldora" series controllers and left out of the newer 1100 series client SSD controllers. The 1098 isn't quite Marvell's top of the line SSD controller, but their 88SS1088 16-channel controller is a two-chip solution that's pretty much two 1098s connected by a 4GB/s link.
Our Haishen3 review samples use the 15mm U.2 form factor, but only have a single PCB internally instead of two staked. That makes the case a fairly large heatsink, with extrusions to allow close contact to the controller and NAND, and cutouts for the two large power loss protection capacitors.
DapuStor is also working on a Haishen3-XL series that uses Kioxia's low-latency 3D SLC NAND to compete against Samsung Z-SSDs, and that should be available very soon.
DERA D5007 Series
DERA is another Chinese brand we were only recently introduced to. They announced their second generation of enterprise SSDs at the end of 2018 and are now trying to enter the North American market after securing design wins with numerous major Chinese cloud and telecom providers.
DERA is part of China's efforts to develop competitive home-grown technology. DERA has their own in-house TAI NVMe SSD controller that is used in the D5437 and D5457 SSDs we are testing today. Their NAND supplier is UNIC, which has been buying Intel 64L wafers since 2018 and doing their own testing, binning and packaging. UNIC eventually plans to be a distributor for Yangtze Memory's 3D NAND, but even though YMTC announced mass production of their 64L NAND back in November, they have not had any significant impact on the NAND market yet. (And probably won't any time soon, given that YMTC is based in Wuhan.) For DRAM, DERA is using Micron DDR3.
DERA D5007 Series Specifications | ||||||||
Model | D5437 | D5457 | ||||||
Controller | DERA TAI-EN316801-P02 | |||||||
Form Factor | 2.5" 15mm U.2 or PCIe HHHL AIC | |||||||
Interface, Protocol | PCIe 3.0 x4 NVMe 1.3 | |||||||
Capacities | 2 TB | 4 TB | 8 TB | 1.6 TB | 3.2 TB | 6.4 TB | ||
NAND Flash | UNIC2 (Intel) 512Gbit 64L 3D TLC | |||||||
DRAM | Micron DDR3-1866 | |||||||
Sequential Read (GB/s) | 3.3 | 3.3 | ||||||
Sequential Write (GB/s) | 1.8 | 2.9 | 3.2 | 1.8 | 2.9 | 3.2 | ||
Random Read IOPS | 820k | 830k | 830k | 820k | 830k | 830k | ||
Random Write IOPS | 110k | 160k | 145k | 240k | 360k | 375k | ||
Power Draw (Watts) |
Random Read | 13.0 | 14.0 | 14.5 | 13.5 | 13.5 | 14.5 | |
Random Write | 15.0 | 18.0 | 18.4 | 15.0 | 18.5 | 20.0 | ||
Idle | 7.0 | |||||||
Write Endurance | 1 DWPD | 3 DWPD | ||||||
Warranty | 5 years |
As is typical, DERA's SSDs come in two versions that share the same basic hardware but use different overprovisioning ratios to optimize for high capacity and read-oriented workloads (D5437), or higher endurance and better write performance (D5457). These drives have most of the hallmarks of a top of the line enterprise SSD. The controller is a large 16-channel design, with a die size around 114 mm2. Internally, the U.2 versions use a folded PCB to provide plenty of space for NAND and DRAM, and the power loss protection capacitors are mounted on a daughterboard. The case design does allow for a little bit of airflow between the two PCB layers, but the power-hungry stuff still needs to be facing outward for contact with the drive case. The NAND packages on the inward-facing PCB surfaces use fewer dies per package, and when a particular capacity doesn't need to populate all of the pads for NAND, those interior spots are the ones left empty.
The DERA SSDs are limited to by the same PCIe 3.0 x4 interface as the rest of our U.2 drives, but the performance specifications are otherwise mostly higher than the other drives in this review—as are the power consumption ratings.
Test Setup
For this year's enterprise SSD reviews, we've overhauled our test suite. The overall structure of our tests is the same, but a lot has changed under the hood. We're using newer versions of our benchmarking tools and the latest longterm support kernel branch. The tests have been reconfigured to drastically reduce CPU overhead, which has minimal impact on SATA drives but lets us properly push the limits of the many enterprise NVMe drives for the first time.
The general philosophy underlying the test configuration was to keep everything at its default or most reasonable everyday settings, and change as little as possible while still allowing us to measure the full performance of the SSDs. Esoteric kernel and driver options that could marginally improve performance were ignored. The biggest change from last year's configuration and away from normal everyday usage is in the IO APIs used by the fio benchmarking tool to interact with the operating system.
In the past, we configured fio to use ordinary synchronous IO APIs: read() and write() style system calls. The way these work is that the application makes a system call to perform a read or write application, and control transfers to the kernel to handle the IO. The application thread is suspended until that IO is complete. This means we can only have one outstanding IO request per thread, and hitting a drive with a queue depth of 32 requires 32 threads. That's no problem on a 36-core test system, but when it takes a queue depth of 200 or more to saturate a high-end NVMe SSD, we run out of CPU power. Running more threads than cores can get us a bit more throughput than just QD36, but that causes latency to suffer not just from the overhead of each system call, but from threads fighting over the limited number of CPU cores. In practice, this testbed is limited to about 560k IOPS when performing IO this way, and that leaves no CPU time for doing anything useful with the data that's moving around. Spectre, Meltdown and other vulnerability mitigations tend to keep increasing system call and context switch overhead, so this situation isn't getting any better.
The alternative is to use asynchronous storage APIs that allow an application thread to submit an IO request to the operating system but then continue executing while the IO is performed. For benchmarking purposes, that continued execution means the application can keep submitting more IO requests before the first one is complete, and a single thread can load down a SSD with a reasonably high queue depth.
Asynchronous IO presents challenges, especially on Linux. On any platform, asynchronous IO is a bit more complicated for the application programmer to deal with, because submitting a request and getting the result become separate steps, and operations may complete out of order. On Linux specifically, the original async IO APIs were fraught with limitations. The most significant is that Linux native AIO is only actually asynchronous when IO is set to bypass the operating system's caches, which is the opposite of what most real-world software should want. (Our benchmarking tools have to bypass the caches to ensure we're measuring the SSD and not the testbed's 192GB of RAM.) Other AIO limitations include support for only one filesystem, and myriad scenarios in which IO silently falls back to being synchronous, unexpectedly halting the application thread that submitted the request. The end result of all those issues is that true asynchronous IO on Linux is quite rare and only usable by some applications with dedicated programmers and competent sysadmins. Benchmarking with Linux AIO makes it possible to stress even the fastest SSD, but such a benchmark can never be representative of how mainstream software does IO.
The best way to set storage benchmark records is to get the operating system kernel out of the way entirely using a userspace IO framwork like SPDK. This eliminates virtually all system call overhead and makes truly asynchronous IO possible and fast. It also eliminates the filesystem and the operating system's caching infrastructure and makes those the application's responsibility. Sharing a SSD between applications becomes almost impossible, and at the very least requires rewriting both applications to use SPDK and overtly cooperate in how they use the drive. SPDK works well for use cases where a heavily customized application stack and system configuration is possible, but it is no more capable of becoming a mainstream solution than Linux AIO.
A New Hope
What's changed recently is that Linux kernel developer (and fio author) Jens Axboe introduced a new asynchronous IO API that's easy to use and very fast. Axboe has documented the rationale behind the new API and how to use it. In summary: The core principle is that communication between the kernel and userspace software takes place with a pair of ring buffers, so the API is called io_uring. One ring buffer is the IO submission queue: the application writes requests into this buffer, and the kernel reads them to act on. The other is the completion queue, where the kernel writes notification of completed IOs, which the application is watching for. This dual queue structure is basically the same as how the operating system communicates with NVMe devices. For io_uring, both queues are mapped into the memory address spaces of both the application and the kernel, so there's no copying of data required. The application doesn't need to make any system calls to check for completed IO; it just needs to inspect the contents of the completion ring. Submitting IO requests involves putting the request in the submission queue, then making a system call to notify the kernel that the queue isn't empty. There's an option to tell the kernel to keep checking the submission queue as long as it doesn't stay idle for long. When that mode is used, a large number of IOs can be handled with an average of approximately zero system calls per IO. Even without it, io_uring allows for IO to be done with one system call per IO compared to two per IO with the old Linux AIO API.
Using synchronous IO, our enterprise SSD testbed cannot reach 600k IOPS. With io_uring, we can do more than 400k IOPS on a single CPU core without any extra performance tuning effort. Hitting 1M IOPS on a real SSD takes at most 4 CPU cores, so even the Micron X100 and upcoming Intel Alder Stream 3D XPoint SSDs should pose no challenge to our new benchmarks.
The first stable kernel to include the io_uring API was version 5.1 released in May 2019. The first long term support (LTS) branch with io_uring is 5.4, released in November 2019 and used in this review. The io_uring API is still very new and not used by much real-world software. But unlike the situation with the old Linux AIO APIs or SPDK, this seems likely to change. It can do more than previous asynchronous IO solutions, including being used for both high-performance storage and network IO. New features are arriving with every new kernel release; lots of developers are trying it out, and I've seen feature requests fulfilled in a matter of days. Many high-level languages and frameworks that currently simulate asynchronous IO using thread pools will be able to implement new io_uring backends.
For storage benchmarking on Linux, io_uring currently strikes the best balance between the competing desires to simulate workloads in a realistic manner, and to accurately gauge what kind of performance a solid state drive is capable of providing. All of the fio-based tests in our enterprise SSD test suite now use io_uring and never run more than 16 threads even when testing queue depths up to 512. With the CPU bottlenecks eliminated, we have also disabled HyperThreading.
Enterprise SSD Test System | |
System Model | Intel Server R2208WFTZS |
CPU | 2x Intel Xeon Gold 6154 (18C, 3.0GHz) |
Motherboard | Intel S2600WFT Firmware 2.01.0009 CPU Microcode 0x2000065 |
Chipset | Intel C624 |
Memory | 192GB total, Micron DDR4-2666 16GB modules |
Software | Linux kernel 5.4.0 fio version 3.16 |
Thanks to StarTech for providing a RK2236BKF 22U rack cabinet. |
QD1 Random Read Performance
Drive throughput with a queue depth of one is usually not advertised, but almost every latency or consistency metric reported on a spec sheet is measured at QD1 and usually for 4kB transfers. When the drive only has one command to work on at a time, there's nothing to get in the way of it offering its best-case access latency. Performance at such light loads is absolutely not what most of these drives are made for, but they have to make it through the easy tests before we move on to the more realistic challenges.
Random read performance at QD1 is mostly determined by the inherent latency of the underlying storage medium. Since most of these SSDs are using 64+ layer 3D TLC, they're all on a fairly even footing. The SK hynix PE6011 is the slowest of the new NVMe drives while the Dapu Haishen3 and Samsung PM1725a are among the fastest.
Power Efficiency in kIOPS/W | Average Power in W |
The drives with big 16-channel controllers have the worst power efficiency at low load, because they idle in the 6-7W range. The DERA SSDs, Samsung PM1725 and Memblaze PBlaze5 are all in the same ballpark. The SK hynix PE6011 is a clear step up, and the Dapu Haishen3s are the most efficient of the new drives on this test. The SATA drives and Samsung's low-power 983 DCT still have higher efficiency ratings because they're under 2W during this test.
Flipping the numbers around to look at latency instead of IOPS, we see that the DERA drives seem to have surprisingly high tail latencies comparable to the old Micron SATA drive and far in excess of any of the other NVMe drives. The rest of the new NVMe drives in our collection have great QoS out to four 9s.
There aren't any surprises when looking at random read performance across a range of block sizes. All of the new NVMe drives have constant IOPS for block sizes of 4kB and smaller, and for larger block sizes IOPS decreases but throughput grows significantly.
QD1 Random Write Performance
Random write performance at QD1 is mostly a matter of delivering the data into the SSD's cache and getting back an acknowledgement; the performance of the NAND flash itself doesn't factor in much until the drive is busier with a higher queue depth. The exception here is the Optane SSD, which doesn't have or need a DRAM cache. Between the fastest and slowest flash-based NVMe SSDs here we're only looking at about a 30% difference. The SK hynix PE6011 and Samsung PM1725a are a bit on the slow side, while the DERA SSDs are among the fastest.
Power Efficiency in kIOPS/W | Average Power in W |
Power draw during this test is generally higher than for the QD1 random read test, but the pattern of bigger SSD controllers being less efficient still mostly holds true. The Dapu Haishen3 and SK hynix PE6011 are the most efficient of our new NVMe drives, and are also helped some by their lower capacity: the 2TB models don't have to spend as much power on keeping DRAM and NAND chips awake.
All the drives start to show elevated tail latency when we go out to four 9s, but the SK hynix PE6011 and Samsung PM1725a also have issues at the 99th percentile level (as does the Intel P4510). The Dapu Haishen3 drives have the best QoS scores on this performance even though their average latency is a few microseconds slower than the fastest flash-based SSDs in this batch.
Looking at random write performance for different block sizes reveals major differences between drives. Everything has obviously been optimized to offer peak IOPS with 4kB block size (except for the Optane SSD). However, several drives do so at the expense of vastly lower performance on sub-4kB block sizes. The DapuStor Haishen3 and DERA SSDs join the Memblaze PBlaze5 on the list of drives that maybe shouldn't even offer the option of operating with 512-byte sector sizes. For those drives, IOPS falls by a factor of 4-5x and they seem to be bottlenecked by doing a read-modify-write cycle in order to support small block writes.
QD1 Sequential Read Performance
When performing sequential reads of 128kB blocks, QD1 isn't enough for any of these drives to really stretch their legs. Unlike consumer SSDs, most of these drives seem to be doing little or no readahead caching, which is probably a reasonable decision for heavily multi-user environments where IO is less predictable. It does lead to lackluster performance numbers, with none of our new drives breaking 1GB/s. The DERA SSDs are fastest of the new bunch, but are only half as fast on this test as the Intel P4510 or Samsung 983 DCT.
Power Efficiency in MB/s/W | Average Power in W |
Even though we're starting to get up to non-trivial throughput with this test, the power efficiency scores are still dominated by the baseline idle power draw of these SSDs. The 16-channel drives are mostly in the 8-9W range (DERA, Samsung PM1725a) while the 8-channel drives are around half that. The DapuStor Haishen3 drives are the most efficient of our new drives, but are still clearly a ways behind the Intel P4510 and Samsung 983 DCT that are much faster on this test.
All of the new NVMe drives in our collection are still showing a lot of performance growth by the time the block size test reaches 1MB reads. At that point, they've all at least caught up with the handful of other drives that performed very well on the QD1/128kB sequential read test, but it's clear that they need either a higher queue depth or even larger block sizes in order to make the most of their theoretical throughput.
QD1 Sequential Write Performance
A few different effects are at play during our QD1 sequential write test. The drives were preconditioned with a few full drive writes before the test, so they're at or near steady-state when this test begins. This leads to the general pattern of larger drives or drives with more overprovisioning performing better, because they can more easily free up a block to accept new writes. However, at QD1 the drives are getting a bit of idle time when waiting on the host system to deliver the next write command, and that results in poor link utilization and fairly low top speeds. It also compresses the spread of scores slightly compared to what the spec sheets indicate we'll see at high queue depths.
The DapuStor Haishen3 drives stand out as the best performers in the 2TB class; they break the pattern of better performance from bigger drives and are performing on par with the 8TB class drives with comparable overprovisioning ratios.
Power Efficiency in MB/s/W | Average Power in W |
The 1.6TB DapuStor Haishen3 H3100 stands out as the most efficient flash-based NVMe SSD on this test, by a fairly wide margin. Its QD1 sequential write performance is similar to the 8TB drives with 16-channel controllers, but the Haishen3 H3100 is also tied for lowest power consumption among the NVMe drives: just under 7W compared to a maximum of over 18W for the 8TB DERA D5437. The Haishen3 H3000's efficiency score is more in line with the rest of the competition, because its lower overprovisioning ratio forces it to spend quite a bit more power on background flash management even at this low queue depth.
In contrast to our random write block size test, for sequential writes extremely poor small-block write performance seems to be the norm rather than the exception; most of these drives don't take kindly to sub-4kB writes. Increasing block sizes past 128kB up to at least 1MB doesn't help the sequential write performance of these drives; in order to hit the speeds advertised on the spec sheets, we need to go beyond QD1.
Peak Throughput
For client/consumer SSDs we primarily focus on low queue depth performance for its relevance to interactive workloads. Server workloads are often intense enough to keep a pile of drives busy, so the maximum attainable throughput of enterprise SSDs is actually important. But it usually isn't a good idea to focus solely on throughput while ignoring latency, because somewhere down the line there's always an end user waiting for the server to respond.
In order to characterize the maximum throughput an SSD can reach, we need to test at a range of queue depths. Different drives will reach their full speed at different queue depths, and increasing the queue depth beyond that saturation point may be slightly detrimental to throughput, and will drastically and unnecessarily increase latency. Because of that, we are not going to compare drives at a single fixed queue depth. Instead, each drive was tested at a range of queue depths up to the excessively high QD 512. For each drive, the queue depth with the highest performance was identified. Rather than report that value, we're reporting the throughput, latency, and power efficiency for the lowest queue depth that provides at least 95% of the highest obtainable performance. This often yields much more reasonable latency numbers, and is representative of how a reasonable operating system's IO scheduler should behave. (Our tests have to be run with any such scheduler disabled, or we would not get the queue depths we ask for.)
Unlike last year's enterprise SSD reviews, we're now using the new io asynchronous IO APIs on Linux instead of the simpler synchronous APIs that limit software to one outstanding IO per thread. This means we can hit high queue depths without loading down the system with more threads than we have physical CPU cores, and that leads to much better latency metrics—but the impact on SATA drives is minimal because they are limited to QD32. Our new test suite uses up to 16 threads to issue IO.
Peak Random Read Performance
Our new test suite with the CPU bottleneck removed is very helpful to the peak random read performance scores of most of these drives. The two SSDs with a PCIe x8 interface stand out. Both can hit over 1M IOPS with a sufficiently high queue depth, though the scores shown here are for somewhat lower queue depths where latency is more reasonable. We're still looking at very high queue depths to get within a few percent of 1M IOPS: QD192 for the Samsung PM1725a and QD384 for the Memblaze PBlaze5 C916.
The U.2 drives are all limited to PCIe 3.0 x4 speeds, and the best random read performance we see out of them comes from the DapuStor Haishen3 H3000 at 751k IOPS, but that's closely followed by the other Dapu drive and all four of the DERA SSDs. The SK hynix PE6011 is the slowest NVMe model here, with its 8TB version coming up just short of 600k IOPS. The Intel Optane SSD's standing is actually harmed significantly by this year's test suite upgrade, because even under last year's suite the drive was as much of a bottleneck as the CPU. Reducing the CPU overhead has allowed many of the flash-based SSDs to pull ahead of the Optane SSD for random read throughput.
Power Efficiency in kIOPS/W | Average Power in W |
Now that we're letting the drives run at high queue depths, the big 16-channel controllers aren't automatically at a disadvantage for power efficiency. Those drives are still drawing much more power (13-14W for the DERA and Memblaze, almost 20W for the Samsung PM1725a), but they can deliver a lot of performance as a result. The drives with 8-channel controllers are mostly operating around 7W, though the 7.68TB SK Hynix PE6011 pushes that up to 10W.
Putting that all in terms of performance per Watt, the DapuStor Haishen3 drives score another clear win on efficiency. Second and third palce are taken by the Samsung 983 DCT and Memblaze PBlaze5 C916, two drives at the opposite end of the power consumption spectrum. After that the scores are fairly tightly clustered with smaller capacity models generally delivering better performance per Watt, because even the 2TB class drives get pretty close to saturating the PCIe 3.0 x4 link and they don't need as much power as their 8TB siblings.
For latency scores, we're no longer going to look at just the mean and tail latencies at whatever queue depth gives peak throughput. Instead, we've run a separate test that submits IO requests at fixed rates, rather than at fixed queue depths. This is a more realistic way of looking at latency under load, because in the real world user requests don't stop arriving just because your backlog hits 32 or 256 IOs. This test starts at a mere 5k IOPS and steps up at 5k increments up to 100k IOPS, and then at 10k increments the rest of the way up to the throughput limit of these drives. That's a lot of data points per drive, so each IO rate is only tested for 64GB of random reads and that leads to the tail latency scores being a bit noisy.
Mean | Median | 99th Percentile | 99.9th Percentile | 99.99th Percentile | |||||
For most of their performance range, these drives stick close to the 20-30µs mean latency we measured at QD1 (which corresponds to around 30k IOPS). The Memblaze PBlaze5 C916 is the only flash-based SSD that maintains great QoS past 100k IOPS. The other drives that make it that far (the Samsung PM1725 and the larger DERA SSDs) start to show 99th percentile latencies over 100µs. The DapuStor Haishen3 H3100 1.6TB showed great throughput when testing at fixed queue depths, but during this test of fixed IO rates it failed out early from an excessive IO backlog, and the H3000 has the worst 99th percentile write scores out of all of the NVMe drives.
Steady-State Sequential Write Performance
As with our sequential read test, we test sequential writes with multiple threads each performing sequential writes to different areas of the drive. This is more challenging for the drive to handle, but better represents server workloads with multiple active processes and users.
As with random writes, the biggest drives with the most overprovisioning tend to also do best on the sequential write test. However, the Intel and Hynix 8TB drives with more modest OP ratios also perform quite well, a feat that the 8TB DERA D5437 fails to match. The DapuStor Haishen3 drives perform a bit better than other small drives: the 2TB H3000 is faster than its competitors from Samsung, Hynix and DERA, and extra OP helps the 1.6TB H3100 perform almost 50% better. However, even the H3100's performance is well below spec; most of these drives are pretty severely affected by this test's multithreaded nature.
Power Efficiency in MB/s/W | Average Power in W |
For the most part, the fast drives are also the ones with the good power efficiency scores on this test. The 8TB Intel and 6.4TB Memblaze have the two best scores. The SATA drives are also quite competitive on efficiency since they use half the power of even the low-power NVMe drives in this bunch. The low-power 2TB class drives from Hynix, Samsung and DapuStor all have similar efficiency scores, and the DERA D5437 drives that are slow in spite of their 16-channel controller turn in the worst efficiency scores.
Mixed Random Performance
Real-world storage workloads usually aren't pure reads or writes but a mix of both. It is completely impractical to test and graph the full range of possible mixed I/O workloads—varying the proportion of reads vs writes, sequential vs random and differing block sizes leads to far too many configurations. Instead, we're going to focus on just a few scenarios that are most commonly referred to by vendors, when they provide a mixed I/O performance specification at all. We tested a range of 4kB random read/write mixes at queue depth 32 (the maximum supported by SATA SSDs) and at QD 128 to better stress the NVMe SSDs. This gives us a good picture of the maximum throughput these drives can sustain for mixed random I/O, but in many cases the queue depth will be far higher than necessary, so we can't draw meaningful conclusions about latency from this test. This test uses 8 threads when testing at QD32, and 16 threads when testing at QD128. This spreads the work over many CPU cores, and for NVMe drives it also spreads the I/O across the drive's several queues.
The full range of read/write mixes is graphed below, but we'll primarily focus on the 70% read, 30% write case that is commonly quoted for mixed IO performance specs.
Queue Depth 32 | Queue Depth 128 |
A queue depth of 32 is only enough to saturate the slowest of these NVMe drives on a 70/30 mixed random workload. All of the high-end drives aren't being stressed enough. At QD128 we see a much wider spread of scores. The DERA and Memblaze 6.4TB drives have pulled past the Optane SSD for overall throughput, but the Samsung PM1725a can't come close to keeping up with them—its throughput is more on par with the DERA D5437 drives with relatively low overprovisioning. The high OP ratio on the DapuStor Haishen3 H3100 allows it to perform much better than any of the other drives with 8-channel controllers, and better than the Intel P4510 which has a 12-channel controller.
QD32 Power Efficiency in MB/s/W | QD32 Average Power in W | ||||||||
QD128 Power Efficiency in MB/s/W | QD128 Average Power in W |
The DapuStor Haishen3 H3100 is the main standout on the power efficiency charts: at QD32 it's the only flash-based NVMe SSD that's more efficient than both of the SATA SSDs, and at QD128 it's getting close to the Optane SSD's efficiency score. Also at QD128 the two fastest 6.4TB drives have pretty good efficiency scores, but still quite a ways behind the Optane SSD: 15-18W vs 10W for similar performance.
QD32 | QD128 |
Most of these drives have hit their power limit by the time the mix is up to about 30% writes. After that point, their performance steadily declines as the workload (and thus power budget) shift more toward slower more power hungry write operations. This is especially true at the higher queue depth. At QD32 things look quite different for the DERA D5457 and Memblaze PBlaze5 C916, because QD32 isn't enough to get close to their full read throughput and they're actually able to deliver higher throughput for writes than for reads. That's not quite true of the Samsung PM1725a because its steady-state random write speed is so much slower, but it does see a bit of an increase in throughput toward the end of the QD32 test run as it gets close to pure writes.
Aerospike Certification Tool
Aerospike is a high-performance NoSQL database designed for use with solid state storage. The developers of Aerospike provide the Aerospike Certification Tool (ACT), a benchmark that emulates the typical storage workload generated by the Aerospike database. This workload consists of a mix of large-block 128kB reads and writes, and small 1.5kB reads. When the ACT was initially released back in the early days of SATA SSDs, the baseline workload was defined to consist of 2000 reads per second and 1000 writes per second. A drive is considered to pass the test if it meets the following latency criteria:
- fewer than 5% of transactions exceed 1ms
- fewer than 1% of transactions exceed 8ms
- fewer than 0.1% of transactions exceed 64ms
Drives can be scored based on the highest throughput they can sustain while satisfying the latency QoS requirements. Scores are normalized relative to the baseline 1x workload, so a score of 50 indicates 100,000 reads per second and 50,000 writes per second. Since this test uses fixed IO rates, the queue depths experienced by each drive will depend on their latency, and can fluctuate during the test run if the drive slows down temporarily for a garbage collection cycle. The test will give up early if it detects the queue depths growing excessively, or if the large block IO threads can't keep up with the random reads.
We used the default settings for queue and thread counts and did not manually constrain the benchmark to a single NUMA node, so this test produced a total of 64 threads scheduled across all 72 virtual (36 physical) cores.
The usual runtime for ACT is 24 hours, which makes determining a drive's throughput limit a long process. For fast NVMe SSDs, this is far longer than necessary for drives to reach steady-state. In order to find the maximum rate at which a drive can pass the test, we start at an unsustainably high rate (at least 150x) and incrementally reduce the rate until the test can run for a full hour, and the decrease the rate further if necessary to get the drive under the latency limits.
The strict QoS requirements of this test keep a number of these drives from scoring as well as we would expect based on their throughput on our other tests. The biggest disappointment is the Samsung PM1725a that's barely any faster than their newer 983 DCT. The PM1725a has no problem with outliers above the 8ms or 64ms thresholds, but it cannot get 95% of the reads to complete in under 1ms until the workload slows way down. This suggests that it is not as good as newer SSDs at suspending writes in favor of handling a read request. The DapuStor Haishen3 SSDs also underperform relative to comparable drives, which is a surprise given that they offered pretty good QoS on some of the pure read or write tests.
The Memblaze PBlaze5 C916 is the fastest flash SSD in this bunch, but only scores 60% of what the Optane SSD gets. The DERA SSDs that also use 16-channel controllers are the next fastest, though the 8TB D5437 is substantially slower than the 4TB model.
Power Efficiency | Average Power in W |
Since the ACT test runs drives at the throughput where they offer good QoS rather than at their maximum throughput, the power draw from these drives isn't particularly high: the NVMe SSDs range from roughly 4-13 W. The top performers are also generally the most efficient drives on this test. Even though it is slower than expected, the DapuStor Haishen3 H3100 is the second most efficient flash SSD in this round-up, using just over half the power that the slightly faster Intel P4510 requires.
Conclusion
Testing nine new drives drives at once makes it tricky to summarize the results with just a few conclusions. Each of the four manufacturers represented in this review are in very different situations requiring different strategies. In turn, the SSDs we've tested all have their own strengths and weaknesses. Since we're unable to offer real insight on how affordable these drives are to the large business customers they are primarily aimed at, we'll focus more on the tech itself.
The Samsung PM1725a is strictly speaking outdated, having been succeeded by a PM1725b with newer 3D NAND and a PM1735 with PCIe 4.0. But it's still a flagship model from the top SSD manufacturer, and we don't get to test those very often. Despite being two generations old, the PM1725a still holds some advantages over the latest and greatest from lower product segments. It has easily the highest read performance of any single SSD we can get our hands on, including the ability to deliver over 1000000 random read IOPS. It's fast enough that most software cannot even come close to fully using its performance potential. The PM1725a also carries a higher write endurance rating than most recent high-end models. However, its actual write performance is not the highest we've seen, and the older NAND and controller mean it sometimes struggles to match the latency of newer drives. The sheer scale of the drive also somewhat limits its usability—not every server can accommodate and make productive use of a drive that is this powerful. On the other hand, the fact that MyDigitalDiscount is currently selling a limited stock of them at consumer SSD pricing means a lot of customers can try out this class of drive for the first time.
The SK Hynix PE6011 targets the most mainstream parts of the datacenter market: high density, manageably low power, 1 DWPD endurance and a performance profile suited for mostly read-oriented workloads. They didn't skimp on capacity, offering up to 7.68TB in a 2.5"/7mm U.2 drive. This kind of drive can't hope to beat the big high-end drives on performance, but it may offer better value. It seems like the biggest challenge for SK Hynix right now is the NAND flash itself. The 72L TLC used in the PE6011 is their first generation of 3D NAND that's remotely competitive, but they have yet to truly get ahead of the rest of the industry despite constantly aiming for the highest layer counts. This 3D TLC seems to be a bit slower than what the other drives are working with, and that cannot be entirely overcome by close cooperation between their NAND, controller and firmware teams. However, they have managed to put together a drive that's decent all around, with no really serious deficiencies. All this drive needs to find success in the market is competitive pricing.
DapuStor's Haishen3 H3000 and H3100 drives have similar aims to the SK Hynix PE6011 and PE6031. The Haishen3 drives use a thicker U.2 form factor but cover pretty much the same capacity range and aim for similar performance and power levels. DapuStor's marketing has a particular focus on their drives' power efficiency, and for good reason: the Hashien3 drives are often far more efficient than the competitors we have tested. A lot of this is probably due to the component selection. With Kioxia 96L TLC NAND, these drives are at least a generation ahead of everything else in this review, and the choice of a Marvell controller seems to have worked out well. The performance of the Haishen3 drives is a bit more hit-and-miss than the SK Hynix PE6011. The Haishen3 is usually a little bit faster than the PE6011 for throughput, but its latency and QoS vary from well ahead to far behind the competition depending on the workload. Between the read oriented H3000 and the H3100 for mixed workloads, these drives have a lot of potential use cases, and the great power efficiency could lead to some real TCO advantages.
DERA's D5437 and D5457 have enough performance to be serious contenders in the high-end enterprise SSD market. Their in-house 16-channel TAI controller might not be quite as powerful or efficient as those from Samsung and Microchip/Microsemi, but it comes close enough - especially taking into account that the controller is a few years old at this point, and about due for a PCIe 4.0 update. It's not hard to see how they've scored major customers at home. Like DapuStor, DERA still has a fairly narrow product line, but they probably have the resources to diversify into other market segments without too much trouble. The big challenges for DERA are probably the relatively non-technical ones: dealing with the geopolitics of being part of the Tsinghua family of companies while trying to sell abroad (especially in the US). They probably won't be able to differentiate their products with the promised benefits of YMTC NAND flash any time soon, so they still need to cultivate a close relationship with one of the incumbent NAND manufacturers (currently Intel).