Original Link: https://www.anandtech.com/show/12435/the-intel-ssd-dc-p4510-ssd-review-part-1-virtual-raid-on-cpu-vroc-scalability
The Intel SSD DC P4510 SSD Review Part 1: Virtual RAID On CPU (VROC) Scalability
by Billy Tallis on February 15, 2018 3:00 PM EST- Posted in
- Storage
- SSDs
- Intel
- RAID
- Enterprise SSDs
- NVMe
- U.2
- Purley
- Skylake-SP
- VROC
Today, Intel is introducing their 64-layer 3D TLC NAND to the enterprise SSD market with the new Intel SSD DC P4510 NVMe drive. They are also finally giving us a chance to test the Virtual RAID On CPU feature introduced with the Skylake-SP and Skylake-X processors last year. Intel has provided four 2TB P4510 SSDs to test against the 8TB model, plus the much sought-after VROC Premium hardware key to unlock the full range of NVMe RAID features on this platform.
The Intel SSD DC P4510 is Intel's first enterprise SSD to use their 64-layer 3D NAND flash memory. The P4510 uses the same second-generation Intel NVMe SSD controller as the other P45xx and P46xx drives introduced last year with 32-layer 3D NAND. With the Optane SSD DC P4800X now covering the highest product segment, Intel's flash-based enterprise NVMe drives are now divided into just two performance and endurance tiers. The P4510 falls into the cheaper tier of drives with much lower write performance and write endurance ratings that are generally between 0.5 drive writes per day (DWPD) and 1 DWPD. The DC P46xx tier has write endurance ratings around 3 DWPD (and the Optane SSD DC P4800X is rated for 30 DWPD).
Intel regards the P45xx tier of products as intended for high-capacity primary storage usage, while the P46xx drives are targeted for use as a cache layer. With their second-generation NVMe controller, Intel is also dividing their enterprise NVMe SSD family on another axis. Unlike the controller used on the P3xxx series, new controller is small enough to fit on a M.2 card, and can operate within the power limits of that form factor. A month after first announcing the P4500 and P4600, Intel introduced the low-power P4501 as their first M.2 drive with an in-house controller. The P4501 is also available as a 7mm thick U.2 drive, compared to the 15mm thickness of the high-power U.2 drives. The P4510 falls into that high-power category, but the smaller capacities are rated for just 10W active power, which is comparable to the 2.5" version of the P4501.
Intel SSD DC P4510 Specifications | |||||
Capacity | 1 TB | 2 TB | 4 TB | 8 TB | |
Form Factor | 2.5" 15mm U.2 | ||||
Interface | PCIe 3.1 x4 NVMe 1.2 | ||||
Memory | Intel 512Gb 64-layer 3D TLC | ||||
Sequential Read | 2850 MB/s | 3200 MB/s | 3000 MB/s | 3200 MB/s | |
Sequential Write | 1100 MB/s | 2000 MB/s | 2900 MB/s | 3000 MB/s | |
Random Read | 469k IOPS | 624k IOPS | 625.5k IOPS | 620k IOPS | |
Random Write | 72k IOPS | 79k IOPS | 113.5k IOPS | 139.5k IOPS | |
Maximum Power | Active | 10 W | 10 W | 14 W | 16 W |
Idle | 5 W | 5 W | 5 W | 5 W | |
Write Endurance | 1.1 DWPD | 0.7 DWPD | 0.9 DWPD | 1.0 DWPD | |
Warranty | 5 years |
Compared to the Intel SSD DC P4500, the biggest performance gains the P4510 brings are to write speeds. Random write performance is more than doubled, and sequential write performance is 60-90% higher. With those performance increases and the reduced power consumption, the Intel claims the P4510 offers twice the performance per watt on some workloads.
Intel has also tuned the P4501's firmware to offer much better quality of service. A major part of this is that the P4510 is better about prioritizing read operations over flash program or erase operations, leading to an order of magnitude improvement in 99.99th percentile read latency. The P4510 also benefits from the 64L 3D TLC's flash program latency, which is about half that of Intel's 32L 3D NAND.
While today marks the first official announcement of the P4510, the 1TB and 2TB capacities started shipping to some large cloud providers in 2017. New today are the 4TB and 8TB models, and the full line will be more broadly available starting this quarter. With a broad range of capacities, and lower prices and higher performance enabled by the transition to 64L 3D NAND, Intel is expecting the P4510 to be their best-selling SSD this year. For now, the P4510 is only available as a 2.5" U.2 drive, but in the future Intel may introduce other form factors, including EDSFF—the now-standardized version of the Intel Ruler concept.
We have only had a few days to play with the Intel SSD DC P4510 and Intel VROC, so the benchmark results in this review are very limited. This review focuses solely on the performance of the P4510 and the scalability limits of Intel VROC. Follow-up reviews will provide our usual in-depth analysis of single-drive performance and power efficiency as compared against other enterprise NVMe SSDs, compare VROC performance between Windows and Linux, and VROC performance with an Optane cache. We will also test VROC with our client SSD test suite and Samsung 960 PRO consumer SSDs.
Test System
For this review, we're using the same system Intel provided for our Optane SSD DC P4800X review. This is a 2U server based on Intel's current Xeon Scalable platform codenamed Purley. The system includes two Xeon Gold 6154 18-core Skylake-SP processors, and 16GB DDR4-2666 DIMMs on all twelve memory channels for a total of 192GB of DRAM.
Each of the two processors provides 48 PCI Express lanes plus a four-lane DMI link. The allocation of these lanes is complicated. Most of the PCIe lanes from CPU1 are dedicated to specific purposes: the x4 DMI plus another x16 link go to the C624 chipset, and there's an x8 link to a connector for an optional SAS controller. This leaves CPU2 providing the PCIe lanes for most of the expansion slots.
Enterprise SSD Test System | |
System Model | Intel Server R2208WFTZS |
CPU | 2x Intel Xeon Gold 6154 (18C, 3.0GHz) |
Motherboard | Intel S2600WFT |
Chipset | Intel C624 |
Memory | 192GB total, Micron DDR4-2666 16GB modules |
Software | CentOS Linux 7.4, kernel 3.10.0-693.17.1 fio version 3.3 |
Thanks to StarTech for providing a RK2236BKF 22U rack cabinet. |
As originally configured, this server was set up to provide as many PCIe slots as possible: 7 x8 slots and one x4 slot. To support U.2 PCIe SSDs in the drive cages at the front of the server, a PCIe switch card was included that uses a Microsemi PM8533 switch to provide eight OCuLink connectors for PCIe x4 cables. This card only has a PCIe x8 uplink, so it is a potential bottleneck for arrays of more than two U.2 SSDs. Without such cards to multiply PCIe lanes, providing PCIe connectivity to all 16 hot swap bays would require almost all of the lanes routed to PCIe expansion card slots. The Xeon Scalable processors provide a lot of PCIe lanes, but with NVMe SSDs it is still easy to run out.
In order to provide a dedicated x4 link to each of our P4510 SSDs, a few components had to be swapped out. First, one of the riser cards providing three x8 slots was exchanged for a riser with one x16 slot and one x8 slot. The PCIe switch card was exchanged for a PCIe x16 retimer card with four OCuLink ports. A PCIe retimer is essentially a pair of back to back PCIe PHYs; its purpose is to reconstruct and retransmit the PCIe signals, ensuring that the signal remains clean enough for full-speed operation over the meter-long path from the CPU to the SSD, passing through six connectors, four circuit boards and 70cm of cable.
When using a riser with a PCIe x16 slot, this server supports configurable lane bifurcation on that slot, so that it can operate as a combination of x4 or x8 links instead. PCIe slot bifurcation support is required to use a single slot to drive four SSDs without an intervening PCIe switch. Bifurcation is generally not supported on consumer platforms, but enthusiast consumer platforms like Skylake-X and AMD's Threadripper are starting to support it—but not on every motherboard. In enthusiast systems, it is expected that consumers will be using M.2 SSDs rather than 2.5" U.2 drives, so several vendors are now selling quad-M.2 adapter cards. In a future review, we will be testing some of those adapters with this system running our client SSD test suite.
The x16 riser and retimer board arrived a few days after the P4510 SSDs, so we have some results from testing four-drive RAID with the PCIe switch card and its x8 uplink bottleneck. Testing of four-drive RAID without the bottleneck is ongoing.
Setting Up VROC
Intel's Virtual RAID on CPU (VROC) feature is a software RAID system that allows for bootable NVMe RAID arrays. There are several restrictions: most notable is the requirement of a hardware dongle to unlock some tiers of VROC functionality. Most Skylake-SP and Skylake-X systems have the small header on the motherboard for installing a VROC key, but the keys themselves have not been possible to buy directly. When first announced, Intel gave the impression that VROC would be available to enthusiast consumers who were willing to pay a few hundred dollars extra. Since then, Intel has retreated to the position of treating VROC as a workstation and server feature that is bundled with OEM-built systems. Intel isn't actively trying to prevent consumers from using VROC on X299 motherboards, but until VROC keys can be purchased at retail, not much can be done.
There are multiple VROC key SKUs, enabling different feature levels. Intel's information about these has been inconsistent. We have a VROC Premium key, which enables everything, including the creation of RAID-5 arrays and the use of non-Intel SSDs. There is also a VROC standard key that does not enable RAID-5, but RAID-0/1/10 are available. More recent documents from Intel also list a VROC Intel SSD Only key, which enables RAID-0/1/10 and RAID-5, but only supports Intel's data center and professional SSD models. Without a VROC hardware key, Intel's RAID solution cannot be used, but some of the underlying features VROC relies on are still available:
Intel Volume Management Device
When 2.5" U.2 NVMe SSDs first started showing up in datacenters, they revealed a widespread immaturity of platform support for features like PCIe hotplug. Proper hotplug and hot-swap of PCIe SSDs required careful coordination between the OS and the motherboard firmware, often through vendor-specific mechanisms. Intel tackled that problem with the Volume Management Device (VMD) feature in Skylake-SP CPUs. Enabling VMD any of the CPU's PCIe ports prevents the motherboard from detecting devices attached to that port at boot time. This shifts all responsibility for device enumeration and management to the OS. With a VMD-aware OS and NVMe driver, SSDs and PCIe switches connected through ports with VMD enabled will be enumerated in an entirely separate PCIe domain, appearing behind a virtual PCIe root complex.
The other major feature of VMD sounds trivial, but is extremely important for datacenters: LED management. 2.5" drives do not have their own status LEDs and instead rely on the hot-swap backplane to implement the indicators necessary to identify activity or a failed device. In the SAS/SCSI world the side channels and protocols for managing this are thoroughly standardized and taken for granted, but the NVMe specifications didn't start addressing this until the addition of the NVMe Management Interface in late 2015. Intel VMD allows systems to use a VMD-specific LED management driver to provide the same standard LED diagnostic blink patterns that a SAS RAID card uses.
PCIe Port Bifurcation
On our test system, PCIe port bifurcation is automatically enabled for x16 slots, but not for x8 slots. No bifurcation settings had to be changed to get VROC working with the four-port PCIe retimer board, but when installing an ASUS Hyper M.2 X16 card with four Samsung 960 PROs, the slot's bifurcation setting had to be manually configured to split the port into four x4 links. This behavior is likely to vary between systems.
Configuring the Array
Once a VROC key is installed and SSDs are installed in VMD-enabled PCIe ports, the motherboard firmware component of VROC can be used. This is a UEFI driver that implements software RAID functionality, plus a configuration utility for creating and managing arrays. On our test system, this firmware utility can be found in the section for UEFI Option ROMs, right below the settings for configuring network booting.
Once an array has been setup, it can be used by any OS that has VROC drivers. All the necessary components are available out of the box from many Linux distributions, so we were able to use CentOS 7.4 without installing any extra software. (Intel provides some extra utilities for management under Linux, but they aren't necessary if you use the UEFI utility to create arrays.) On Windows, Intel's VROC drivers need to be installed, or loaded while running the Windows Setup if the operating system will be booting from a VROC array.
Configurations Tested
With four 2TB Intel SSD P4510 drives at our disposal and one 8TB drive to compare against, this review includes results from the following VROC configurations:
- Four 2TB P4510s in RAID-0
- Four 2TB P4510s in RAID-10
- Two 2TB P4510s in RAID-0
- A Single 2TB P4510
- A Single 8TB P4510
There are also partial results from four-drive RAID-0, RAID-10 and RAID-5 from before the x16 riser and retimer board arrived. These configurations had all four 2TB drives connected through a PCIe switch with a PCIe x8 uplink. Testing of four drives in RAID-5 without this bottleneck is in progress.
Random Read Performance
In order to properly stress a four-drive NVMe RAID array, this test covers queue depths much higher than our client SSD tests: from 1 to 512, using from one to eight worker threads each with queue depths up to 64. Each queue depth is tested for four minutes, and the performance average excludes the first minute. The queue depths are tested back to back with no idle time. The individual read operations are 4kB, and cover the span of the drive or array. Prior to running this test, the drives were preconditioned by writing to the entire drive with random writes, twice over.
The 2TB and 8TB P4510s have the same peak random read performance. The RAID-0 and RAID-10 configurations both provide about three times the performance of a single drive, and the two-drive RAID-0 provides just under twice the performance of a single drive.
The individual P4510 drives don't saturate until at least QD128. The RAID configurations with a bottleneck from the PCIe x8 switch uplink have hit that limit by the end of the test, but the four-drive configurations without that bottleneck could clearly deliver even higher throughput with more worker threads.
Random Write Performance
As with the random read test, this test covers queue depths from 1 to 512, using from one to eight worker threads each with queue depths up to 64. Each queue depth is tested for four minutes, and the performance average excludes the first minute. The queue depths are tested back to back with no idle time. The individual write operations are 4kB, and cover the span of the drive or array. This test was run immediately after the random read test, so the drives had been preconditioned with two full drive writes of random writes.
The four-drive RAID-0 configuration manages to provide five times the random write throughput of a single 2TB drive, and even the configuration with a PCIe x8 bottleneck is over four times faster than a single drive. The RAID-10 configurations and the two-drive RAID-0 are only slightly faster than the single 8TB drive, which has more than twice the random write throughput of the 2TB model.
Several of the test runs show performance drops in the second half that we did not have time to debug, but the general pattern seems to be that random write performance saturates at relatively low queue depths, around QD16 or QD32.
Sequential Read Performance
The structure of this test is the same as the random read test, except that the reads performed are 128kB and are arranged sequentially. This test covers queue depths from 1 to 512, using from one to eight worker threads each with queue depths up to 64. Each worker thread is reading from a different section of the drive. Each queue depth is tested for four minutes, and the performance average excludes the first minute. The queue depths are tested back to back with no idle time. The individual read operations are 128kB, and cover the span of the drive or array. Prior to running this test, the drives were preconditioned by writing to the entire drive sequentially, twice over.
The sequential read performance results are pretty much as expected. The 2TB and 8TB drives have the same peak throughput. The two-drive RAID-0 is almost as fast as the four-drive array configurations that were working with a PCIe x8 bottleneck, and with that bottleneck removed, performance of the four-drive RAID-0 and RAID-10 increases by 80%.
All but the RAID-5 configuration show a substantial drop in throughput from QD1 to QD2 as competition between threads is introduced, but performance quickly recovers. The individual drives reach full speed at QD16 (eight threads each at QD2). Unsurprisingly, the two-drive configuration saturates at QD32 and the four-drive arrays saturate at QD64.
Sequential Write Performance
The structure of this test is the same as the sequential read test. This test covers queue depths from 1 to 512, using from one to eight worker threads each with queue depths up to 64. Each worker thread is writing to a different section of the drive. Each queue depth is tested for four minutes, and the performance average excludes the first minute. The queue depths are tested back to back with no idle time. The individual write operations are 128kB, and cover the span of the drive or array. This test was run immediately after the sequential read test, so the drives had been preconditioned with sequential writes.
The 8TB P4510 delivers far higher sequential write throughput than the 2TB model. The four-drive RAID-10 configuration requires more than PCIe x8 to beat the 8TB drive. The four-drive RAID-0 is about 3.6 times faster than a single 2TB drive, but only 2.4 times faster than the equivalent capacity 8TB drive.
The sequential write throughput of most configurations saturates with queue depths of just 2-4. The 8TB drive takes a bit longer to reach full speed around QD8. The performance of a four-drive array scales up more slowly when it is subject to a PCIe bottleneck, even before it has reached that upper limit.
Mixed Random Performance
Our test of mixed random reads and writes covers mixes varying from pure reads to pure writes at 10% increments. Each mix is tested for four minutes, with the first minute excluded from the statistics. The test is conducted with eight worker threads and total queue depths of 8, 64 and 512. This test is conducted immediately after the random write test, so the drives have been thoroughly preconditioned with random writes across the entire drive or array.
QD 8 | ||||||||||
QD 64 | ||||||||||
QD 512 |
At the relatively low queue depth of 8, the individual P4510 drives show fairly flat performance across the varied mixes of reads and writes. The RAID configurations help a little bit with the random read performance, but have a much bigger effect on write throughput.
Mixed Sequential Performance
Our test of mixed sequential reads and writes differs from the mixed random I/O test by performing 128kB sequential accesses rather than 4kB accesses at random locations. The highest queue depth tested here is 256. The range of mixes tested is the same, and the timing of the sub-tests are also the same as above. This test was conducted immediately after the sequential write test, so the drives had been preconditioned with sequential writes over their entire span.
QD 8 | ||||||||||
QD 64 | ||||||||||
QD 256 |
At QD8, the single 2TB P4510 again has fairly flat performance across the range of mixes, but the 8TB model picks up speed as the proportion of writes increases. The four-drive RAID-0 shows strong increases in performance as the mix becomes more write heavy, and the two-drive RAID-0 shows a similar but smaller effect over most of the test.
At QD64 and QD256, the huge difference in write performance between the four-drive RAID-0 and RAID-10 configurations is apparent. The configurations with a PCIe x8 bottleneck show entirely different behavior, peaking in the middle of the test when they are able to take advantage of the full-duplex nature of PCIe, and slowest at either end of the test when one-way traffic saturates the link. For even balances of reads and writes, the PCIe x8 bottleneck barely affects overall throughput.
Looking Forward
So far, we've only scratched the surface of the Intel SSD DC P4510 and Intel's VROC solution for NVMe RAID. We've been able to verify only the most basic specifications for the drives, but most of the initial results match up with our expectations. The 8TB Intel SSD DC P4510 offers some substantial performance increases over the 2TB model, and the aggregate capacity and performance it enables from a single server is impressive. Once we get the necessary adapter to measure power consumption of U.2 drives with our Quarch XLC Programmable Power Module, we will begin a deeper analysis of the P4510 on a single-drive basis, comparing it against competing enterprise NVMe SSDs. It is already clear that Intel has made significant generational performance improvements, but the purported efficiency improvements will be at least as tantalizing to datacenter customers. The 16W rated power draw of the 8TB P4510 sounds pretty good in comparison to the 20.5W draw of a 4TB P4500, or the 12W of a 2TB P3520.
Intel's Virtual RAID On CPU (VROC) solution for NVMe RAID has very particular requirements for its use, but once all the pieces are in place, its use is straightforward. The experience using VROC on Linux has been fairly smooth so far, but we have not yet attempted to install an operating system to a VROC array. The RAID-0 and RAID-10 performance we've observed shows excellent scalability to 11.6GB/s sequential reads and over 2M random read IOPS from four drives, with no effort on our part to fine-tune the system for maximum performance. At the moment, we are testing RAID-5 performance and the CPU usage of the parity calculations involved in writing to RAID-5 arrays. At the speed of four NVMe drives like the P4510, this overhead is significant and requires multiple CPU cores, but it is unavoidable: there are no hardware RAID solutions for NVMe drives that can keep up with an array of four P4510s.