Original Link: https://www.anandtech.com/show/11209/intel-optane-ssd-dc-p4800x-review-a-deep-dive-into-3d-xpoint-enterprise-performance
The Intel Optane SSD DC P4800X (375GB) Review: Testing 3D XPoint Performance
by Billy Tallis on April 20, 2017 12:00 PM ESTIntel's new 3D XPoint non-volatile memory technology, which has been on the cards publically for the last couple of years, is finally hitting the market as the storage medium for Intel's new flagship enterprise storage platform. The Intel Optane SSD DC P4800X is a PCIe SSD using the standard NVMe protocol, but the use of 3D XPoint memory instead of NAND flash memory allows it to deliver great throughput and much lower access latency than any other NVMe SSD.
3D XPoint
The potential significance of 3D XPoint memory is immense. When it was first publicly announced by Intel and Micron in 2015, 3D XPoint memory was a fundamentally different storage technology from the flash memory that dominates the market. It is the first new truly mass market, high-density solid state storage medium to hit the market since NAND flash itself. It comes at a time where the NAND market is booming like never before, but also at a time when we know that there is a definite end of the line for NAND. The ongoing transition to 3D NAND flash is just a temporary postponement of the fundamental limitations of flash memory. Once NAND can no longer scale in density and cost-per-bit, it will fall to paradigm changes and next-generation memory technologies (one of which will be 3D XPoint) to continue to carry the industry forward. There are many other new memory technologies that may compete alongside flash memory and 3D XPoint in the coming years, but 3D XPoint is the one that's ready to go mainstream now.
In the near term, 3D XPoint is important because it offers a new set of performance tradeoffs entirely unlike NAND; tradeoffs that, for the right applications, can deliver performance far in excess of today's NAND products. By being able to read and write at the bit or word level - and not the 4K+ page level of NAND - 3D XPoint has the potential to deliver excellent performance across a wide range of workloads, but especially in minimally parallel workloads, which are common in the consumer and enterprise spaces.
The drawback here is that, due to various factors regarding time, production, and scope, 3D XPoint is more expensive than NAND. It also comes in as less dense, to aid in ease of production in this first stage, but this also adds to the cost. For now, due to scale and other factors, it won't be able to replicate the sheer capacity and cost effectiveness that has made NAND storage so popular in all market segments. Due to the scale, especially as a first-generation version of the technology, the first 3D XPoint products are being aimed at speciality and high-margin markets: enterprise performance, consumer caching, etc. Future products promised from Intel should add non-volatile DIMMs to the mix, and then later on, if everything goes to plan, a potential wholesale replacement of NAND flash (or at least a strong competitor).
The Intel Optane SSD DC P4800X
The new storage drive, and the focus of today's review, is the Intel Optane SSD DC P4800X. It uses a new NVMe controller Intel developed specifically for use with 3D XPoint memory. Where Intel's enterprise NVMe SSDs like the P3700 use a controller with 18 channels for interfacing to their flash memory, the Optane SSD's controller has only 7 channels. In order to achieve at least parity on peak performance, each of those channels has to provide much higher throughput than on a flash SSD, and it shows that each 3D XPoint memory die is delivering much higher performance than a die of flash memory.
The first capacity of the Optane SSD DC P4800X to ship and the model we've tested here offers a usable capacity of 375GB from a total of 28 3D XPoint memory dies (four per channel) for a raw capacity of 448GB. 3D XPoint memory has better endurance than NAND flash, but not enough to get away without wear levelling. The fine-grained accessibility of 3D XPoint memory gets rid of a lot of the wear leveling and write amplification headaches caused by flash pages and erase blocks being larger than the sector sizes exposed by the drives, but the drive still needs some spare area plus storage for error correction overhead, metadata for tracking the mapping between logical blocks and physical addresses, and potential replacement of bad sectors, similar to normal SSDs.
As with most NVMe SSDs, the Optane SSD DC P4800X supports a configurable sector size. Out of the box it emulates 512B sectors for the sake of compatibility, but using the NVMe FORMAT command it can be switched to emulate 4kB sectors. The larger sector size reduces the amount of metadata the SSD controller has to juggle, so it usually allows for slightly higher performance. The NVMe FORMAT command is also the mechanism for triggering a secure erase of the entire drive, and for flash SSDs the format usually consists of little more than issuing block erase commands to the whole drive. 3D XPoint memory does not have large multi-megabyte erase blocks, so a low-level format of the Optane SSD needs to directly write to the entire drive, which takes about as long as filling it sequentially. Thus, while a 2.4TB flash SSD can perform a low-level format in just over 13 seconds, the 375GB Optane SSD DC P4800X takes six minutes and 47 seconds. This is long enough that unsuspecting software tools or SSD reviewers will give up and assume that the drive has locked up.
Intel Optane SSD DC P4800X Specifications | ||||
Capacity | 375 GB | 750 GB | 1.5 TB | |
Form Factor | PCIe HHHL or 2.5" 15mm U.2 | |||
Interface | PCIe 3.0 x4 NVMe | |||
Controller | Intel unnamed | |||
Memory | 128Gb 20nm Intel 3D XPoint | |||
Typical Latency (R/W) | <10µs | |||
Random Read (4 kB) IOPS (QD16) | 550,000 | TBA | TBA | |
Random Read 99.999% Latency (QD1) | 60µs | TBA | TBA | |
Random Read 99.999% Latency (QD16) | 150µs | TBA | TBA | |
Random Write (4 kB) IOPS (QD16) | 500,000 | TBA | TBA | |
Random Write 99.999% Latency (QD1) | 100µs | TBA | TBA | |
Random Write 99.999% Latency (QD16) | 200µs | TBA | TBA | |
Mixed 70/30 (4kB) Random IOPS (QD16) | 500,000 | TBA | TBA | |
Endurance | 30 DWPD | |||
Warranty | 5 years (3 years during early limited release) | |||
MSRP | $1520 | TBA | TBA | |
Release Date | HHHL | March 19 | Q2 2017 | 2H 2017 |
U.2 | Q2 2017 | 2H 2017 |
So far, Intel has only started shipping the 375GB Optane SSD DC P4800X to select customers, and they have not released detailed specifications for the larger capacities that will ship later this year.
It is worth noting that the performance specifications for the P4800X, as provided in the product specification sheets, cover a different set of metrics than Intel usually reports for their enterprise SSDs. Sequential performance is not mentioned at all, but the product brief has quite a bit to say about latency: average latency for QD1 reads and writes, and 99.999th percentile latency for both reads and writes at QD1 and QD16. The fact that Intel is publishing a five-nines QoS metric at all suggests that they plan to set a new standard for performance consistency.
The throughput claims are also remarkable: half a million IOPS or more for reads, writes and a 70/30 read/write mix. There are already drives on the market that can deliver more than 550k random read IOPS, but those SSDs are far larger than 375GB and they require very high queue depths to hit 550k IOPS. There are even a few multi-TB drives that can beat 500k random write IOPS, but they can't sustain that performance indefinitely. The Optane SSD DC P4800X is promising an unprecedented level of storage performance both in absolute terms and relative to its capacity, so it is interesting to see where Intel is going to lay down its line in the sand.
The P4800X will not really occupy the same niche as the multi-TB monsters that offer comparable throughput. With limited capacity but the highest level of performance, this Optane SSD most closely fits the role of SLC NAND based SSDs. SLC has disappeared from the SSD market as virtually all customers preferred to sacrifice a little bit of performance to double their capacity by using MLC NAND. One of the last high-performance SLC SSDs was the Micron P320h, a PCIe SSD from 2012 that slightly pre-dated NVMe and used 34nm SLC NAND flash. Anyone still using a P320h for its consistent low latency performance will be very interested in the P4800X. Outside of that niche, the Optane SSD will obviously be desirable for its raw throughput, but the low capacity may be problematic for some use cases.
One of the unique and most notable performance advantages of the Optane SSD DC P4800X is that it does not require extremely high queue depths to reach full throughput. Enterprise customers have long had to design their systems around the fact that getting full performance out of the fastest PCIe SSDs requires loading them down with queue depths of 128 or higher, sometimes requiring applications to use dozens of threads for I/O. In the client space achieving such queue depths is outright impossible, and in the enterprise space it doesn't happen for free. The P4800X's high performance at low queue depths makes it a much easier drive to get great real-world performance out of.
Intel originally introduced 3D XPoint memory as having far higher write endurance than NAND flash—on the order of 1000x higher. The Optane SSD DC P4800X is rated for 30 drive writes per day (DWPD) for five years, and the current models shipping during this early limited availability period are only rated for three years, rather than the five years it expects the support for the full retail models. Intel says they're being extremely conservative with a new and unproven technology, and doing the math means that 30 DWPD doesn't provide any endurance advantage over the most highly over-provisioned flash-based enterprise SSDs. In terms of total petabytes written, the P4800X only has four-fifths the endurance of the SLC-based Micron P320h. Even allowing for Intel's original comparisons possibly having been relative to lower-endurance contemporary MLC or TLC flash, it seems like this first generation of 3D XPoint memory is not as durable as originally planned - the headline number of 30 DWPD is aimed at alleviating that issue, however for Intel to match its original intentions then the second and third generation parts will have to be a step up, and we look forward to testing them.
Pricing
The MSRP for the 375GB P4800X is $1520, though it will be quite some time before it can readily be ordered from major online retailers. At slightly more than $4/GB, the P4800X will be almost twice as expensive per GB as Intel's next most pricey SSD, the P3608 (which is really two drives in one plus a PCIe switch). Compared to Intel's fastest single SSD (the P3700), the P4800X will be more than three times as expensive per GB. In the broader SSD market, $4/GB is not completely unprecedented, but most companies selling drives in this price range don't even pretend to have a retail price.
This Review
For this review of the Intel Optane SSD DC P4800X, first, we are going to take a deeper dive into what 3D XPoint actually is. Then we go through our testing suite for enterprise drives, testing Intel's claims on performance.
It is worth noting that there is no such thing as a general-purpose enterprise SSD. Enterprise storage workloads are far more varied than client workloads and it is impossible to make general statements about whether random or sequential performance is more important, what kind of mix of reads and writes to expect, or what queue depth is apporpriate to test with. Real-world application benchmarks are difficult to construct and typically end up being far more narrowly applicable than we would hope. Our strategy for this review is to provide a very broad range of synthetic tests with the knowledge that not all results will be relevant to all use cases. Enterprise customers must know and understand their own workload. Since this is our first time testing anything with 3D XPoint memory, this review includes some new benchmarks that would probably not be applicable to a flash SSDs review, making for some interesting numbers.
3D XPoint Refresher
Intel's 3D XPoint memory technology is fundamentally very different from NAND flash. Intel has not clarified any more low-level details since their initial joint announcement with Micron of this technology, so our analysis from 2015 is still largely relevant. The industry consensus is that 3D XPoint is something along the lines of a phase change memory or conductive bridging resistive RAM, but we won't know for sure until third parties put 3D XPoint memory under an electron microscope.
Even without knowing the precise details, the high-level structure of 3D XPoint confers some significant advantages and disadvantages relative to NAND flash or DRAM. 3D XPoint can be read or written at the bit or word level, which greatly simplifies random access and wear leveling as compared to the multi-kB pages that NAND flash uses for read or program operations and the multi-MB blocks used for erase operations. Where DRAM requires a transistor for each memory cell, 3D XPoint isolates cells from each other by stacking them each in series with a diode-like selector. This frees up 3D XPoint to use a multi-layer structure, though not one that is as easy to manufacture as 3D NAND flash. This initial iteration of 3D XPoint uses just two layers and provides a per-die capacity of 128Gb, a step or two behind NAND flash but far ahead of the density of DRAM. 3D XPoint is currently storing just one bit per memory cell while today's NAND flash is mostly storing two or three bits per cell. Intel has indicated that the technology they are using, with sufficient R&D, can support more bits per cell to help raise density.
The general idea of a resistive memory cell paired with a selector and built at the intersections of word and bit lines is not unique to 3D XPoint memory. The term "crosspoint" has been used to describe several memory technologies with similar high-level architectures but different implementation details. As one Intel employee has explained, it is relatively easy to discover a material that exhibits hysteresis and thus has the potential to be used as a memory cell. The hard part is desiging a memory cell and selector that are fast, durable, and manufacturable at scale. The greatest value in Intel's 3D XPoint technology is not the high-level design but the specific materials and manufacturing methods that make it a practical invention. It has been noted by some analysts that the turning point for technologies such as 3D XPoint may very well be in the development in the selector itself, which is believed to be a Schottky diode or an ovonic selector.
In addition to the advantages that any resistive memory built on a crosspoint array can expect, Intel's 3D XPoint memory is supposed to offer substantially higher write endurance than NAND flash, and much lower read and write times. Intel has only quantified the low-level performance of 3D XPoint memory with rough order of magnitude comparisons against DRAM and NAND flash in general, so this test of the Optane SSD DC P4800X is the first chance to get some precise data. Unfortunately, we're only indirectly observing the capabilities of 3D XPoint, because the Optane SSD is still a PCIe SSD with a controller translating the block-oriented NVMe protocol and providing wear leveling.
The only other Optane product Intel has announced so far is another PCIe SSD, but on an entirely different scale: the Optane Memory product for consumers uses just one or two 3D XPoint chips and is intended to serve as a 32GB cache device accelerating access to a mechanical hard drive or slower SATA SSD. Next year Intel will start talking about putting 3D XPoint on DIMMs, and by then if not sooner we should have more low-level information about 3D XPoint technology.
Test Configurations
So while the Intel SSD DC P4800X is technically launching today, 3D XPoint memory is still in short supply. Only the 375GB add-in card model has been shipped, and only as part of an early limited release program. The U.2 version of the 375GB model and the add-in card 750GB model are planned for a Q2 release, and the U.2 750GB model and the 1.5TB model are expected in the second half of 2017. Intel's biggest enterprise customers, such as the Super Seven, have had access to Optane devices throughout the development process, but broad retail availability is still a little ways off.
Citing the current limited supply, Intel has taken a different approach to review sampling for this product. Their general desire for secrecy regarding the low-level details of 3D XPoint has also likely been a factor. Instead of shipping us the Optane SSD DC P4800X to test on our own system, as is normally the case with our storage testing, this time around Intel has only provided us with remote access to a DC P4800X system housed in their data center. Their Non-Volatile Memory Solutions Group maintains a pool of servers to provide partners and customers with access to the latest storage technologies and their software partners have been using these systems for months to develop and optimize applications to take advantage of Optane SSDs.
Intel provisioned one of these servers for our exclusive use during the testing period, and equipped it with a 375GB Optane SSD DC P4800X and a 800GB SSD DC P3700 for comparison. The P3700 was the U.2 version of the drive and was connected through a PLX PEX 9733 PCIe switch. The Optane SSD under test was initially going to be a U.2 version connected to the same backplane, but Intel found that the PCIe switch was introducing some inconsistency in the access latency on the order of a microsecond or two, which is a problem when trying to benchmark a drive with ~8µs best case latency. Intel swapped out the U.2 Optane SSD for an add-in card version that uses PCIe lanes direct from the processor, but the P3700 was still potentially subject to whatever problems the PCIe switch may have caused. Clearly, there's some work to be done to ensure the ecosystem is ready to take full advantage of the performance promised by Optane SSDs, but debugging such issues is beyond the scope of this review.
Intel NSG Marketing Test Server | |
CPU | 2x Intel Xeon E5 2699 v4 |
Motherboard | Intel S2600WTR2 |
Chipset | Intel C612 |
Memory | 256GB total, Kingston DDR4-2133 CL11 16GB modules |
OS | Ubuntu Linux 16.10, kernel 4.8.0-22 |
The system was running a clean installation of Ubuntu 16.10, with no Intel or Optane-specific software or drivers installed, and the rest of the system configuration was as expected. We had full administrative access to tweak the software to our liking, but chose to leave it mostly in its default state.
Our benchmarking is a variety of synthetic workloads generated and measured using fio version 2.19. There are quite a few operating system and fio options that can be tuned, but we generally ignored them: for example the NVMe driver wasn't manually switched to polling mode, or the CPU affinity was not manually set, and nothing was tweaked about power management or CPU clock speed turbo. There is work underway to switch fio over to using nanosecond-precision time measurement, but it has not reached a usable state yet. Our tests only record latencies in microsecond increments, and mean latencies that report fractional microseconds are just weighted averages of eg. how many operations were closer to 8µs than 9µs.
All tests were run directly on the SSD with no intervening filesystem. Real-world applications will almost always be accessing the drive through a filesystem, but will also be benefiting from the operating system's cache in main RAM, which is bypassed with this testing methodology.
To provide an extra point of comparison, we also tested the Micron 9100 MAX 2.4TB on one of our systems, using a Xeon E3 1240 v5 processor. In order to not unfairly disadvantage the Micron 9100, most of the tests were limited to use at most 4 threads. Our test system was running the same Linux kernel as the Intel NSG marketing test server and used a comparable configuration with the Micron 9100 connected directly to the CPU's PCIe lanes rather than through the PCH.
AnandTech Enterprise SSD Testbed | |
CPU | Intel Xeon E3 1240 v5 |
Motherboard | ASRock Fatal1ty E3V5 Performance Gaming/OC |
Chipset | Intel C232 |
Memory | 4x 8GB G.SKILL Ripjaws DDR4-2400 CL15 |
OS | Ubuntu Linux 16.10, kernel 4.8.0-22 |
Because this was not a hands-on test of the Optane SSD on our own equipment, we were unable to conduct any power consumption measurements. Due to the limited time available for testing, we were unable to make any systematic test of write endurance or the impact of extra overprovisioning on performance. We hope to have the opportunity to conduct a full hands-on review later in the year to address these topics.
Due to time, we were unable to cover Intel's new Memory Drive Technology software. This is an optional software add-on that can be purchased with the Optane SSD. The Memory Drive Technology software is a minimal virtualization system that allows software to pretend that their Optane SSD is RAM. The hypervisor will present to the guest OS a pool of memory equal to the amount of available DRAM plus up to 320GB of the Optane SSD's 375GB capacity. The hypervisor manages the placement of data to automatically cache hot data in DRAM, such that applications or the guest OS cannot explicitly address or allocate Optane storage. We may get a chance to look at this in the future, as it offers an interesting aspect of the new ways multi-tiered storage will be affecting the Enterprise market over the next few years.
Checking Intel's Numbers
The product brief for the Optane SSD DC P4800X provides a limited set of performance specifications, entirely omitting any standards for sequential throughput. Some latency and throughput targets are provided for 4kB random reads, writes, and a 70/30 mix of reads and writes.
This section has our results for how the Optane SSD measures up to Intel's advertised specifications and how the flash SSDs fare on the same tests. The rest of this review provides deeper analysis of how these drives perform across a range of queue depths, transfer sizes, and read/write mixes.
4kB Random Read at a Queue Depth of 1 (QD1) | |||||||
Drive | Throughput | Latency (µs) | |||||
MB/s | IOPS | Mean | Median | 99th | 99.999th | ||
Intel Optane SSD DC P4800X 375GB | 413.0 | 108.3k | 8.9 | 9 | 10 | 37 | |
Intel SSD DC P3700 800GB | 48.7 | 12.8k | 77.9 | 76 | 96 | 2768 | |
Micron 9100 MAX 2.4TB | 35.3 | 9.2k | 107.7 | 104 | 117 | 306 |
Intel's queue depth 1 specifications are expressed in terms of latency, and at a throughput specification at QD1 would be redundant. Intel specifies a "typical" latency of less than 10µs, and most QD1 random reads on the Optane SSD take 8 or 9µs; even the 99th percentile latency is still 10µs.
The 99.999th percentile target is less than 60µs, which the Optane SSD beats by a wide margin. Overall, the Optane SSD passes with ease. The flash SSDs are 8-12x slower on average, and the 99.999th percentile latency of the Intel P3700 is far worse, at around 75x slower.
4kB Random Read at a Queue Depth of 16 (QD16) | |||||||
Drive | Throughput | Latency (µs) | |||||
MB/s | IOPS | Mean | Median | 99th | 99.999th | ||
Intel Optane SSD DC P4800X 375GB | 2231.0 | 584.8k | 25.5 | 25 | 41 | 81 | |
Intel SSD DC P3700 800GB | 637.9 | 167.2k | 93.9 | 91 | 163 | 2320 | |
Micron 9100 MAX 2.4TB | 517.5 | 135.7k | 116.2 | 114 | 205 | 1560 |
Intel's QD16 random read result is 584.8k IOPS for throughput, which is above the official specification of 550k IOPS by a few percent. The 99.999th percentile latency scores 81µs, significantly under the target of less than 150µs. The flash SSDs are 3-5x slower on most metrics, but 20-30 times slower at the 99.999th percentile for latency.
4kB Random Write at a Queue Depth of 1 (QD1) | |||||||
Drive | Throughput | Latency (µs) | |||||
MB/s | IOPS | Mean | Median | 99th | 99.999th | ||
Intel Optane SSD DC P4800X 375GB | 360.6 | 94.5k | 8.9 | 9 | 10 | 64 | |
Intel SSD DC P3700 800GB | 350.6 | 91.9k | 9.2 | 9 | 18 | 81 | |
Micron 9100 MAX 2.4TB | 160.9 | 42.2k | 22.2 | 22 | 24 | 76 |
In the specifications, the QD1 random write specifications are 10µs on latency, while the 99.999th percentile for latency is relaxed from 60µs to 100µs. In our results, the QD1 random write throughput (360.6 MB/s) of the Optane SSD is a bit lower than the QD1 random read throughput (413.0 MB/s), but the latency is roughly the same (8.9µs mean, 10µs on 99th).
However it is worth noting that the Optane SSD only manages a passing score when the application uses asynchronous I/O APIs. Using simple synchronous write() system calls pushes the average latency up to 11-12µs.
Also, due to the capacitor-backed DRAM caches, the flash SSDs also handle QD1 random writes very well. The Intel P3700 also manages to keep latency mostly below 10µs, and all three drives have 99.999th percentile latency below Intel's 100µs standard for the Optane SSD.
4kB Random Write at a Queue Depth of 16 (QD16) | |||||||
Drive | Throughput | Latency (µs) | |||||
MB/s | IOPS | Mean | Median | 99th | 99.999th | ||
Intel Optane SSD DC P4800X 375GB | 2122.5 | 556.4 | 27.0 | 23 | 65 | 147 | |
Intel SSD DC P3700 800GB | 446.3 | 117.0 | 134.8 | 43 | 1336 | 9536 | |
Micron 9100 MAX 2.4TB | 1144.4 | 300.0 | 51.6 | 34 | 620 | 3504 |
The Optane SSD DC P4800X is specified for 500k random write IOPS using four threads to provide a total queue depth of 16. In our tests, the Optane SSD scored 556.4k IOPs, exceeding the specification by more than 11%. This equates to a random write throughput of more than 2GB/s.
The flash SSDs are more dependent on the parallelism benefits of higher capacities, and as a result can be slow at the same capacity. Hence in this case the 2.4TB Micron 9100 fares much better than the 800GB Intel P3700. The Micron 9100 hits its own specification right on the nose with 300k IOPS and the Intel P3700 comfortably exceeds its own 90k IOPS specification, although remaining the slowest of the three by far. The Optane SSD stays well below its 200µs limit for 99.999th percentile latency by scoring 147µs, while the flash SSDs have outliers of several milliseconds. Even at the 99th percentile the flash SSDs are 10-20x slower than Optane.
4kB Random Mixed 70/30 Read/Write Queue Depth 16 | |||||||
Drive | Throughput | Latency (µs) | |||||
MB/s | IOPS | Mean | Median | 99th | 99.999th | ||
Intel Optane SSD DC P4800X 375GB | 1929.7 | 505.9 | 29.7 | 28 | 65 | 107 | |
Intel SSD DC P3700 800GB | 519.9 | 136.3 | 115.5 | 79 | 1672 | 5536 | |
Micron 9100 MAX 2.4TB | 518.0 | 135.8 | 116.0 | 105 | 1112 | 3152 |
On a 70/30 read/write mix, the Optane SSD DC P4800X scores 505.9k IOPS, which beats the specification of 500k IOPS by 1%. Both of the flash SSDs deliver roughly the same throughput, a little over a quarter of the speed of the Optane SSD. Intel doesn't provide a latency specification for this workload, but the measurements unsurprisingly fall in between the random read and random write results. While low-end consumer SSDs sometimes perform dramatically worse on mixed workloads than on pure read or write workloads, none of these drives have that problem due to their market positioning and capabilities therein.
Random Read
Random read speed is the most difficult performance metric for flash-based SSDs to improve on. There is very limited opportunity for a drive to do useful prefetching or caching, and parallelism from multiple dies and channels can only help at higher queue depths. The NVMe protocol reduces overhead slightly, but even a high-end enterprise PCIe SSD can struggle to offer random read throughput that would saturate a SATA link.
Real-world random reads are often blocking operations for an application, such as when traversing the filesystem to look up which logical blocks store the contents of a file. Opening even an non-fragmented file can require the OS to perform a chain of several random reads, and since each is dependent on the result of the last, they cannot be queued.
Our first test of random read performance looks at the dependence on transfer size. Most SSDs focus on 4kB random access as that is the most common page size for virtual memory systems and it is a common filesystem block size. Maximizing 4kB performance has gotten more difficult as NAND flash has moved to page sizes that are larger than 4kB, and some SSD vendors have started including 8kB random access specifications. It is worth noting that 3D XPoint memory, from a fundamental standpoint, does not impose any inherent block size restrictions on the Optane SSD, but for compatibility purposes the P4800X by default exposes a 512B sector size.
Queue Depth 1
For our test, each transfer size was tested for four minutes and the statistics exclude the first minute. The drives were preconditioned to steady state by filling them with 4kB random writes twice over.
Vertical Axis scale: | Linear | Logarithmic |
The Optane SSD starts off with about eight times the throughput of the other drives for small random reads. As the transfer sizes grow past 16kB the Optane SSD's performance starts to level off and the flash SSDs start to catch up, with the Micron 9100 overtaking the Intel P3700. At 1MB transfer size the Optane SSD is only providing an additional 50% higher throughput than the Micron 9100.
Queue Depth >1
Next, we consider 4kB random read performance at queue depths greater than one. A single-threaded process is not capable of saturating the Optane SSD DC P4800X with random reads so this test is conducted with up to four threads. The queue depths of each thread are adjusted so that the queue depth seen by the SSD varies from 1 to 64, with every single queue depth from 1 through 16, then 18, 20, and factors of four up to 64 (so 24, 28, 32... to 64). The timing is the same as for the other tests: four minutes for each tested queue depth, with the first minute excluded from the statistics.
Looking just at the range of throughputs and latencies achieved, it is clear that the Optane SSD DC P4800X is in a different league entirely from the flash SSDs. The Optane SSD saturates part way through the test with a throughput +30% higher than what the Micron 9100 can deliver even at QD64, and at the same time its 99.999th percentile latency is half of the Micron 9100's median latency.
Between the two flash SSDs, the Intel P3700 has better performance on average through most of the test, but its maximum achieved throughput is slightly lower than the Micron 9100's peak and the 9100 offers lower latency at the high end. The Micron 9100 also has much better 99.999th percentile latency across almost the entire range of queue depths.
Vertical Axis units: | IOPS | MB/s |
In absolute terms, the Optane SSD's performance is uncontested. Even though the Optane SSD's random read throughput is saturating at QD8, by QD6 it's outperforming what either flash SSD can deliver at any reasonable queue depth. Beyond QD8 the Optane SSD does not deliver even incremental improvement in throughput and increasing queue depth just adds latency. This test stops at QD64, which isn't enough to saturate either flash SSD. The Micron 9100 MAX is rated for a maximum of 750k random read IOPS, but clearly the Optane SSD delivers far better performance at the kinds of queue depths that are reasonably attainable.
Mean | Median | 99th Percentile | 99.999th Percentile |
All three SSDs show median latency growing slowly across a wide range of queue depths. At QD1 the 99th percentile curves are very close to the median latency curves, but at high queue depths the 99th percentile latency is around twice the median. For the Optane SSD and the Micron 9100 MAX, the 99.999th percentile latency is higher by another factor of two or so, but the Intel P3700 cannot deliver such tight regulation and its worst-case latencies are well over a millisecond.
Random Write
Flash memory write operations are far slower than read operations. This is not always reflected in the performance specifications of SSDs because writes can be deferred and combined, allowing the SSD to signal completion before the data has actually moved from the drive's cache to the flash memory. The 3D XPoint memory used by the Optane SSD DC P4800X does have slower writes than reads, and it was commented that Intel did not specificy read latency when Optane was initially announced, but our results show that the disparity is not as large. With inherently fast writes and no page size and erase block limitations, the Optane SSD should be far less reliant on write combining and large spare areas to offer high throughput random writes. The drive's translation layer is probably far simpler than what flash SSDs require, potentially giving a latency advantage.
Queue Depth 1
As with random reads, we first examine QD1 random write performance of different transfer sizes. 4kB is usually the most important size, but some applications will make smaller writes when the drive has a 512B sector size. Larger transfer sizes make the workload somewhat less random, reducing the amount of bookkeeping the SSD controller needs to do and generally allowing for increased performance.
Vertical Axis scale: | Linear | Logarithmic |
The Micron 9100 really doesn't like random writes smaller than 4kB, but both Intel drives handle it relatively well. The Optane SSD DC P4800X has only a 30% higher throughput result than the P3700 for transfer sizes of 4kB and smaller. The Intel P3700 (owing mainly to its relatively low capacity) doesn't benefit very much as transfer sizes grow beyond 4kB, as it saturates soon after. The Optane SSD maintains a clear lead for transfers of 8kB and larger, averaging about twice the throughput of the Micron 9100 as both show diminishing returns from increased transfer sizes.
Queue Depth >1
The test of 4kB random write throughput at different queue depths is structured identically to its counterpart random write test above. Queue depths from 1 to 64 are tested, with up to four threads used to generate this workload. Each tested queue depth is run for four minutes and the first minute is ignored when computing the statistics.
The QD1 starting points for all three drives are somewhat close together, with the fastest drive (the Optane SSD, of course) only offering about twice the random write throughput than the Micron 9100, with less than half the average latency. From there, the gaps widen quickly. The Intel P3700 reaches its maximum throughput very quickly and then the latency just piles up. The Micron 9100 keeps its median and 99th percentile latency reasonably well controlled until reaching its maximum throughput, which is half of what the Optane SSD can deliver.
Vertical Axis units: | IOPS | MB/s |
While QD64 wasn't enough to completely saturate the flash SSDs with random reads, here with random writes, QD8 is enough for any of the drives, and the P3700 is done around QD2. The Micron 9100 starts out as the slowest of the three but soon overtakes the Intel P3700.
When examining the latency statistics, we should keep in mind that all three drives reached their full throughput by QD8. At queue depths higher than that, latency increases with no improvement to throughput. A well-tuned server will generally not be operating the drives in that regime, so the right half of these graphs can be mostly ignored.
Mean | Median | 99th Percentile | 99.999th Percentile |
Median latency for these drives is quite flat until they reach saturation. 99th percentile latency for the flash SSDs shoots up when they're operated at unnecessarily high queue depths. The 99.999th percentile latency of the Intel P3700 is never less than 1ms and actually exceeds 10ms at the end of the test. The Micron 9100's 99.999th percentile latency is fairly close to that of the Optane SSD until the 9100 hits QD4, where it spikes and surpasses 1ms shortly before the drive reaches full throughput. Meanwhile, the Optane SSD's 99.999th percentile latency only climbs up to a third of a millisecond even at QD64.
Sequential Read
Intel provides no specifications for sequential access performance of the Optane SSD DC P4800X. Buying an Optane SSD for a mostly sequential workload would make very little sense given that sufficiently large flash-based SSDs or RAID arrays can offer plenty of sequential throughput. Nonetheless, it will be interesting to see how much faster the Optane SSD is with sequential transfers instead of random access.
Sequential access is usually tested with 128kB transfers, but this is more of an industry convention and is not based on any workload trend as strong as the tendency for random I/Os to be 4kB. The point of picking a size like 128kB is to have transfers be large enough that they can be striped across multiple controller channels and still involve writing a full page or more to the flash on each channel. Real-world sequential transfer sizes vary widely depending on factors like which application is moving the data or how fragmented the filesystem is.
Even without a large native page size to its 3D XPoint memory, we expect the Optane SSD DC P4800X to exhibit good performance from larger transfers. A large transfer requires the controller to process fewer operations for the same amount of user data, and fewer operations means less protocol overhead on the wire. Based on the random access tests, it appears that the Optane SSD is internally managing the 3D XPoint memory in a way that greatly benefits from transfers being at least 4kB even though the drive emulates a 512B sector size out of the box.
The drives were preconditioned with two full writes using 4kB random writes, so the data on each drive is entirely fragmented. This may limit how much prefetching of user data the drives can perform on the sequential read tests, but they can likely benefit from better locality of access to their internal mapping tables.
Queue Depth 1
The test of sequential read performance at different transfer sizes was conducted at queue depth 1. Each transfer size was used for four minutes, and the throughput was averaged over the final three minutes of each test segment.
Vertical Axis scale: | Linear | Logarithmic |
For transfer sizes up to 32kB, both Intel drives deliver similar sequential read speeds. Beyond 32kB the P3700 appears to be saturated but also highly inconsistent. The Micron 9100 is plodding along with very low but steadily growing speeds, and by the end of the test it has almost caught up with the Intel P3700. It was at least ten times slower than the Optane SSD until the transfer size reached 64kB. The Optane SSD passes 2GB/s with 128kB transfers and finishes the test at 2.3GB/s.
Queue Depth > 1
For testing sequential read speeds at different queue depths, we use the same overall test structure as for random reads: total queue depths of up to 64 are tested using a maximum of four threads. Each thread is reading sequentially but from a different region of the drive, so the read commands the drive receives are not entirely sorted by logical block address.
The Optane SSD DC P4800X starts out with a far higher QD1 sequential read speed than either flash SSD can deliver. The Optane SSD's median latency at QD1 is not significantly better than what the Intel P3700 delivers, but the P3700's 99th and 99.999th percentile latencies are at least an order of magnitude worse. Beyond QD1, the Optane SSD saturates while the Intel P3700 takes a temporary hit to throughput and a permanent hit to latency. The Micron 9100 starts out with low throughput and fairly high latency, but with increasing queue depth it manages to eventually surpass the Optane SSD's maximum throughput, albeit with ten times the latency.
Vertical Axis units: | IOPS | MB/s |
The Intel Optane SSD DC P4800X starts this test at 1.8GB/s for QD1, and delivers 2.5GB/s at all higher queue depths. The Intel P3700 performs significantly worse when a second QD1 thread is introduced, but by the time there are four threads reading from the drive the total throughput has recovered. The Intel P3700 saturates a little past QD8, which is where the Micron 9100 passes it. The Micron 9100 then goes on to surpass the Optane SSD's throughput above QD16, but it too has saturated by QD64.
Mean | Median | 99th Percentile | 99.999th Percentile |
The Optane SSD's latency increases modestly from QD1 to QD2, and then unavoidably increases linearly with queue depth due to the drive being saturated and unable to offer any better throughput. The Micron 9100 starts out with almost ten times the average latency, but is able to hold that mostly constant as it picks up most of its throughput. Once the 9100 passes the Optane SSD in throughput it is delivering slightly better average latency, but substantially higher 99th and 99.999th percentile latencies. The Intel P3700's 99.999th percentile latency is the worst of the three across almost all queue depths, and its 99th percentile latency is only better than the Micron 9100's during the early portions of the test.
Sequential Write
The sequential write tests are structured identically to the sequential read tests save for the direction the data is flowing. The sequential write performance of different transfer sizes is conducted with a single thread operating at queue depth 1. For testing a range of queue depths, a 128kB transfer size is used and up to four worker threads are used, each writing sequentially but to different portions of the drive. Each sub-test (transfer size or queue depth) is run for four minutes and the performance statistics ignore the first minute.
Vertical Axis scale: | Linear | Logarithmic |
As with random writes, sequential write performance doesn't begin to take off until transfer sizes reach 4kB. Below that size, all three SSDs offer dramatically lower throughput, with the Optane SSD narrowly ahead of the Intel P3700. The Optane SSD shows the steepest growth as transfer size increases, but it and the Intel P3700 begin to show diminishing returns beyond 64kB. The Optane SSD almost reaches 2GB/s by the end of the test while the Intel P3700 and the Micron 9100 reach around 1.2-1.3GB/s.
Queue Depth > 1
When testing sequential writes at varying queue depths, the Intel SSD DC P3700's performance was highly erratic. We did not have sufficient time to determine what was going wrong, so its results have been excluded from the graphs and analysis below.
The Optane SSD DC P4800X delivers better sequential write throughput at every queue depth than the Micron 9100 can deliver at any queue depth. The Optane SSD's latency increases only slightly as it reaches saturation while the Micron 9100's 99th percentile latency begins to climb steeply well before that drive reaches its maximum throughput. The Micron 9100's 99.999th percentile latency also grows substantially as throughput increases, but its growth is more evenly spread across the range of queue depths.
Vertical Axis units: | IOPS | MB/s |
The Optane SSD reaches its maximum throughput at QD2 and maintains it as more threads and higher queue depths are introduced. The Micron 9100 only provides a little over half of the throughput and requires a queue depth of around 6-8 to reach that performance.
Mean | Median | 99th Percentile | 99.999th Percentile |
The Micron 9100's 99th percentile latency starts out around twice that of the Optane SSD, but at QD3 it increases sharply as the drive approaches its maximum throughput until it is an order of magnitude higher than the Optane SSD. The 99.999th percentile latencies of the two drives are separated by a wide margin throughout the test.
Mixed Read/Write Performance
Workloads consisting of a mix of reads and writes can be particularly challenging for flash based SSDs. When a write operation interrupts a string of reads, it will block access to at least one flash chip for a period of time that is substantially longer than a read operation takes. This hurts the latency of any read operations that were waiting on that chip, and with enough write operations throughput can be severely impacted. If the write command triggers an erase operation on one or more flash chips, the traffic jam is many times worse.
The occasional read interrupting a string of write commands doesn't necessarily cause much of a backlog, because writes are usually buffered by the controller anyways. But depending on how much unwritten data the controller is willing to buffer and for how long, a burst of reads could force the drive to begin flushing outstanding writes before they've all been coalesced into optimal sized writes.
The effect of a read still applies to the Optane SSD's 3D XPoint memory, but with greatly reduced severity. Whether a block of reads coming in has an effect depends on how the Optane SSD's controller manages the 3D XPoint memory.
Queue Depth 4
Our first mixed workload test is an extension of what Intel describes in their specifications for throughput of mixed workloads. A total queue depth of 16 is achieved using four worker threads, each performing a mix of random reads and random writes. Instead of just testing a 70% read mixture, the full range from pure reads to pure writes is tested at 10% increments.
Vertical Axis units: | IOPS | MB/s |
The Optane SSD's throughput does indeed show the bathtub curve shape that is common for this sort of mixed workload test, but the sides are quite shallow and the minimum (at 40% reads/60% writes) is still 83% of the peak throughput (which occurs with the all-reads workload). While the Optane SSD is operating near 2GB/s the flash SSDs spend most of the test only slightly above 500MB/s. When the portion of writes increases to 70%, the two flash SSDs begin to diverge: the Intel P3700 loses almost half its throughput and only recovers a little of it during the remainder of the test, while the Micron 9100 begins to accelerate and comes much closer to the Optane SSD's level of performance.
Mean | Median | 99th Percentile | 99.999th Percentile |
The median latency curves for the two flash SSDs show a substantial drop when the median operation switches from a read to a cacheable write. The P3700's median latency even briefly drops below that of the Optane SSD, but then the Optane SSD is handling several times the throughput. The 99th and 99.999th percentile latencies for the Optane SSD are relatively flat after jumping a bit when writes are first introduced to the mix. The flash SSDs have far higher 99th and 99.999th percentile latencies through the middle of the test, but much fewer outliers during the pure read and pure write phases.
Adding Writes to a Drive that is Reading
The next mixed workload test takes a different approach and is loosely based on the Aerospike Certification Tool. The read workload is constant throughout the test: a single thread performing 4kB random reads at QD1. Threads performing 4kB random writes at QD1 and throttled to 100MB/s are added to the mix until the drive's throughput is saturated. As the write workload gets heavier, the random read throughput will drop and the read latency will increase.
The three SSDs have very different capacity for random write throughput: the Intel P3700 tops out around 400MB/s, the Micron 9100 can sustain 1GB/s, and the Intel Optane SSD DC P4800X can sustain almost 2GB/s. The Optane SSD's average read latency increases by a factor of 5, but that still enough to provide about 25k read IOPS. The flash SSDs both experience read latency growing by an order of magnitude as write throughput approaches saturation. Even though the Intel P3700 has a much lower capacity for random writes, it provides slightly lower random read latency at its saturation point than the Micron 9100. When comparing the two flash SSDs with the same write load, the Micron 9100 provides far more random read throughput.
Final Words: Is 3D XPoint Ready?
The Intel Optane SSD DC P4800X is a very high-performing enterprise SSD, but more importantly it is the first shipping product using Intel's 3D XPoint memory technology. After a year and a half of talking up 3D XPoint, Intel has finally shipped something. The P4800X proves that 3D XPoint memory is real and that it really works. The P4800X is just a first-generation product, but it's more than sufficient to establish 3D XPoint memory as a serious contender in the storage market.
If your workload matches its strengths, the P4800X offers performance that cannot currently be provided by any other storage product. This means high throughput random access, as well as very strict latency requirements - the results Optane achieves for it's quality of service for latency on both reads and writes, especially in heavy environments with a mixed read/write workload, is a significant margin ahead of anything available on the market.
At 50/50 reads/writes, latency QoS for the DC P4800X is 30x better than the competition
The Intel Optane SSD DC P4800X is not the fastest SSD ever on every single test. It's based on a revolutionary technology, but no matter how high expectations were, very rarely does a first-generation product take over the world unless it becomes ubiquitous and cheap on day one. The Optane SSD is ultimately an expensive niche product. If you don't need high throughput random access with the strictest latency requirements, the Optane SSD DC P4800X may not be the best choice. It is very expensive compared to most flash-based SSDs.
With the Optane SSD and 3D XPoint memory now clearly established as useful and usable, the big question is how broad its appeal will be. The originally announcements around Optane promised a lot, and this initial product delivers a few of those metrics, so to some extent, the P4800X may have to grow its own market and reteach partners what Optane is capable of today. Working with developers and partners is going to be key here - they have to perform outreach and entice software developers to write applications that rely on extremely fast storage. That being said, there are plenty of market segments already that can never get enough storage performance, so anything above what is available in the market today will be more than welcome.
There's still much more we would like to know about the Optane SSD and the 3D XPoint memory it contains. Since our testing was remote, we have not yet even had the chance to look under the drives's heatsink, or measure the power efficiency of the Optane SSD and compare it against other SSDs. We are awaiting an opportunity to get a drive in hand, and expect some of the secrets under the hood to be exposed in due course as drives filter through the ecosystem.