Original Link: https://www.anandtech.com/show/9112/exploring-dx12-3dmark-api-overhead-feature-test
Exploring DirectX 12: 3DMark API Overhead Feature Test
by Ryan Smith & Ian Cutress on March 27, 2015 8:00 AM EST- Posted in
- Radeon
- Futuremark
- GeForce
- GPUs
- 3DMark
- DirectX 12
To say there’s a bit of excitement for DirectX 12 and other low-level APIs is probably an understatement. A big understatement. With DirectX 12 ramping up for a release later this year, Mantle 1.0 already in pseudo-release, and its successor Vulkan under active development, the world of graphics APIs is changing in a way not seen since the earliest days, when APIs such as Direct3D, OpenGL, and numerous vendor proprietary APIs were first released. From a consumer standpoint this change will still take a number of years, but from a development standpoint 2015 is going to be the year that everything changed for PC graphics programming.
So far much has been made about the benefits of these APIs, the potential performance improvements, and ultimately what can be done and what new things can be achieved with them. The true answer to those questions are that this is going to be a multi-generational effort; until games are built from the ground-up for these APIs, developers won’t be able to make full use of their capabilities. Even then, the coolest tricks will take some number of years to develop, as developers become better acquainted with these new APIs, their idiosyncrasies, and the capabilities of the underlying hardware when interfaced with these APIs. In other words, right now we’re just scratching the surface.
The first DirectX 12 games are expected towards the end of the year, and in the meantime Microsoft and their hardware partners have been ramping up the DirectX 12 ecosystem, hammering out the API implementation in Windows 10 while the hardware vendors write and debug their WDDM 2.0 drivers. Meanwhile as this has been going on, we’ve seen a slow release of software released designed to showcase DirectX 12 features in a proof of concept manner. A number of various internal demos exist, and we saw the first semi-public DirectX 12 software release last month with our look at Star Swarm.
This week the benchmarking gurus over at Futuremark are releasing their own first run at a DirectX 12 test with their latest update for the 3DMark benchmark. Futuremark has been working away at DirectX 12 for some time – in fact they were the first partner to show DirectX 12 code in action at Microsoft’s 2014 DX12 unveiling – and now they are releasing their first DirectX 12 project.
In keeping with the general theme of the demos we’ve seen so far, Futuremark’s new DirectX 12 release is another proof of concept test. Dubbed the 3DMark API Overhead Feature Test, this benchmark is a purely synthetic benchmark designed to showcase the draw call benefits of the new API even more strongly than earlier benchmarks. Whereas Star Swarm was a best-case scenario test within the confines of a realistic graphics workload, the API Overhead Feature Test is a proper synthetic benchmark that is designed to test one thing and one thing only: how many draw calls a system can handle. The end result, as we’ll see, showcases just how great the benefits of DirectX 12 are in this situation, allowing for an order of magnitude’s improvement, if not more.
To do this, Futuremark has written a relatively simple test that draws out a very simple scene with an ever-increasing number of objects in order to measure how many draw calls a system can handle before it becomes saturated. As expected for a synthetic test, the underlying rendering task is very simple – render an immense amount of building-like objections at both the top and bottom of the screen – and the bottleneck is in processing the draw calls. Generally speaking, under this test you should either be limited by the number of draw calls you can generate (CPU limited) or limited by the number of draw calls you can consume (GPU’s command processor limited), and not the GPU’s actual rendering capabilities. The end result is that the API Overhead Feature Test can push an even larger number of draw calls than Star Swarm could.
To showcase the difference between various APIs, this test is available with DirectX 12 and Mantle, but also two different DirectX 11 modes. Standard DirectX 11 single-threading is one mode, alongside support for DirectX 11 multi-threading. The latter has a checkered history – it never did work as well in the real world as initially hoped – and in practice only NVIDIA supports it to any decent degree. But regardless, as we’ll see DirectX 12’s throughput will put even DX11MT to shame.
FutureMark’s complete technical description is posted below:
The test is designed to make API overhead the performance bottleneck. The test scene contains a large number of geometries. Each geometry is a unique, procedurally-generated, indexed mesh containing 112 -127 triangles.
The geometries are drawn with a simple shader, without post processing. The draw call count is increased further by drawing a mirror image of the geometry to the sky and using a shadow map for directional light.
The scene is drawn to an internal render target before being scaled to the back buffer. There is no frustum or occlusion culling to ensure that the API draw call overhead is always greater than the application side overhead generated by the rendering engine.
Starting from a small number of draw calls per frame, the test increases the number of draw calls in steps every 20 frames, following the figures in the table below.
To reduce memory usage and loading time, the test is divided into two parts. The second part starts at 98304 draw calls per frame and runs only if the first part is completed at more than 30 frames per second.
Draw calls per frame Draw calls per frame increment per step Accumulated duration in frames 192 – 384 12 320 384 – 768 24 640 768 – 1536 48 960 1536 – 3072 96 1280 3072 – 6144 192 1600 6144 – 12288 384 1920 12288 – 24576 768 2240 24576 – 49152 1536 2560 49152 – 98304 3072 2880 98304 – 196608 6144 3200 196608 – 393216 12288 3520
Other Notes
Before jumping into our results, let’s quickly talk about testing.
For our test we are using the latest version of the Windows 10 technical preview – build 10041 – and the latest drivers from AMD, Intel, and NVIDIA. In fact for testing DirectX 12 these latest packages are the minimum versions that the test supports. Meanwhile 3DMark does of course also run on Windows Vista and later, however on Windows Vista/7/8 only the DirectX 11 and Mantle tests are available since those are the only APIs available.
From a test reliability standpoint the API Overhead Feature Test (or as we’ll call it from now, AOFT) is generally reliable under DirectX 12 and Mantle, however we would like to note that we have found it to be somewhat unreliable under DirectX 11. DirectX 11 scores have varied widely at times, and we’ve seen one configuration flip between 1.4 million draw calls per second and 1.9 million draw calls per second based on indeterminable factors.
Our best guess right now is that the variability comes from the much greater overhead of DirectX 11, and consequently all of the work that the API, video drivers, and OS are all undertaking in the background. Consequently the DirectX 11 results are good enough for what the AOFT has set out to do – showcase just how much incredibly faster DX12 and Mantle are – but it has a much higher degree of variability than our standard tests and should be treated accordingly.
Meanwhile Futuremark for their part is looking to make it clear that this is first and foremost a test to showcase API differences, and is not a hardware test designed to showcase how different components perform.
The purpose of the test is to compare API performance on a single system. It should not be used to compare component performance across different systems. Specifically, this test should not be used to compare graphics cards, since the benefit of reducing API overhead is greatest in situations where the CPU is the limiting factor.
We have of course gone and benchmarked a number of configurations to showcase how they benefit from DirectX 12 and/or Mantle, however as per Futuremark’s guidelines we are not looking to directly compare video cards. Especially since we’re often hitting the throughput limits of the command processor, something a real-world task would not suffer from.
The Test
Moving on, we also want to quickly point out the clearly beta state of the current WDDM 2.0 drivers. Of note, the DX11 results with NVIDIA’s 349.90 driver are notably lower than the results with their WDDM 1.3 driver, showing much greater variability. Meanwhile AMD’s drivers have stability issues, with our dGPU testbed locking up a couple of different times. So these drivers are clearly not at production status.
DirectX 12 Support Status | ||||
Current Status | Supported At Launch | |||
AMD GCN 1.2 (285) | Working | Yes | ||
AMD GCN 1.1 (290/260 Series) | Working | Yes | ||
AMD GCN 1.0 (7000/200 Series) | Working | Yes | ||
NVIDIA Maxwell 2 (900 Series) | Working | Yes | ||
NVIDIA Maxwell 1 (750 Series) | Working | Yes | ||
NVIDIA Kepler (600/700 Series) | Working | Yes | ||
NVIDIA Fermi (400/500 Series) | Not Active | Yes | ||
Intel Gen 7.5 (Haswell) | Working | Yes | ||
Intel Gen 8 (Broadwell) | Working | Yes |
And on that note, it should be noted that the OS and drivers are all still in development. So performance results are subject to change as Windows 10 and the WDDM 2.0 drivers get closer to finalization.
One bit of good news is that DirectX 12 support on AMD GCN 1.0 cards is up and running here, as opposed to the issues we ran into last month with Star Swarm. So other than NVIDIA’s Fermi cards, which aren’t turned on in beta drivers, we have the ability to test all of the major x86-paired GPU architectures that support DirectX 12.
For our actual testing, we’ve broken down our testing for dGPUs and for iGPUs. Given the vast performance difference between the two and the fact that the CPU and GPU are bound together in the latter, this helps to better control for relative performance.
On the dGPU side we are largely reusing our Star Swarm test configuration, meaning we’re testing the full range of working DX12-capable GPU architectures across a range of CPU configurations.
DirectX 12 Preview dGPU Testing CPU Configurations (i7-4960X) | |||
Configuration | Emulating | ||
6C/12T @ 4.2GHz | Overclocked Core i7 | ||
4C/4T @ 3.8GHz | Core i5-4670K | ||
2C/4T @ 3.8GHz | Core i3-4370 |
Meanwhile on the iGPU side we have a range of Haswell and Kaveri processors from Intel and AMD respectively.
CPU: | Intel Core i7-4960X @ 4.2GHz |
Motherboard: | ASRock Fatal1ty X79 Professional |
Power Supply: | Corsair AX1200i |
Hard Disk: | Samsung SSD 840 EVO (750GB) |
Memory: | G.Skill RipjawZ DDR3-1866 4 x 8GB (9-10-9-26) |
Case: | NZXT Phantom 630 Windowed Edition |
Monitor: | Asus PQ321 |
Video Cards: | AMD Radeon R9 290X AMD Radeon R9 285 AMD Radeon HD 7970 NVIDIA GeForce GTX 980 NVIDIA GeForce GTX 750 Ti NVIDIA GeForce GTX 680 |
Video Drivers: | NVIDIA Release 349.90 Beta AMD Catalyst 15.200.1012.2 Beta |
OS: | Windows 10 Technical Preview (Build 10041) |
CPU: | AMD A10-7850K AMD A10-7700K AMD A8-7600 AMD A6-7400L Intel Core i7-4790K Intel Core i5-4690 Intel Core i3-4360 Intel Core i3-4130T Pentium G3258 |
Motherboard: | GIGABYTE F2A88X-UP4 for AMD ASUS Maximus VII Impact for Intel LGA-1150 Zotac ZBOX EI750 Plus for Intel BGA |
Power Supply: | Rosewill Silent Night 500W Platinum |
Hard Disk: | OCZ Vertex 3 256GB OS SSD |
Memory: | G.Skill 2x4GB DDR3-2133 9-11-10 for AMD G.Skill 2x4GB DDR3-1866 9-10-9 at 1600 for Intel |
Video Cards: | AMD APU Integrated Intel CPU Integrated |
Video Drivers: | AMD Catalyst 15.200.1012.2 Beta Intel Driver Version 10.18.15.4124 |
OS: | Windows 10 Technical Preview (Build 10041) |
Discrete GPU Testing
We’ll kick things off with our discrete GPUs, which should present us with a best case scenario for DirectX 12 from a hardware standpoint. With the most powerful CPUs powering the most powerful GPUs, the ability to generate a massive number of draw calls and to have them consumed in equally large number, this is where DirectX 12 will be at its best.
We’ll start with a look a CPU scaling on our discrete GPUs. How much benefit do we see going from 2 to 4 and finally 6 CPU cores?
The answer on the CPU side is quite a lot. Whereas Star Swarm generally topped out at 4 cores – after which it was often GPU limited – we see gains all the way up to 6 cores on our most powerful cards. This is a simple but important reminder of the fact that the AOFT is a synthetic test designed specifically to push draw calls and avoid all other bottlenecks as much as possible, leading to increased CPU scalability.
With that said, it’s clear that we’re reaching the limits of our GPUs with 6 cores. While the gains from 2 to 4 cores are rather significant, increasing from 4 to 6 (and with a slight bump in clockspeed) is much more muted, even with our most powerful cards. Meanwhile anything slower than a Radeon R9 285X is showing no real scaling from 4 to 6 cores, indicating a rough cutoff right now of how powerful a card needs to be to take advantage of more than 4 cores.
Moving on, let’s take a look at the actual API performance scaling characteristics at 6, 4, and 2 cores.
6 cores of course is a best case scenario for DirectX 12 – it’s the least likely to be CPU-bound – and we see first-hand the incredible increase in draw call throughput by switching from DirectX 11 to DirectX 12 or Mantle.
Somewhat unexpectedly, the greatest gains and the highest absolute performnace are achieved by AMD’s Radeon R9 290X. As we saw in Star Swarm and continue to see here, AMD’s DirectX 11 throughput is relatively poor, topping out at 1.1 draw calls for both DX11ST and DX11MT. AMD simply isn’t able to push much more than that many calls through their drivers, and without real support for DX11 multi-threading (e.g. DX11 Dirver Command Lists), they gain nothing from the DX11MT test.
But on the opposite side of the coin, this means they have the most to gain from DirectX 12. The R9 290X sees a 16.8x increase in draw call throughput switching from DX11 to DX12. At 18.5 million draw calls per second this is the highest draw call rate out of any of our cards, and we have good reason to suspect that we’re GPU command processor limited at this point. Which is to say that our CPU could push yet more draw calls if only a GPU existed that could consumer that many calls. On a side note, 18.5M calls would break down to just over 300K calls per frame at 60fps, which is a similarly insane number compared to today’s standards where draw calls per frame in most games is rarely over 10K.
Meanwhile we see a reduction in gains going from the 290X to the 285 and finally to the 7970. As we mentioned earlier we appear to be command processor limited, and each one of these progressively weaker GPUs appears to contain a similarly weaker command processor. Still, even the “lowly” 7970 can push 11.6M draw calls per second, which is a 10.5x (order of magnitude) increase in draw call performance over DirectX 11.
Mantle on the other hand presents an interesting aside. As AMD’s in-house API (and forerunner to Vulkan), the AMD cards do even better on Mantle than they do DirectX 12. At this point the difference is somewhat academic – what are you going to do with 20.3M draw calls over 18.5M – but it goes to show that Mantle can still squeeze out a bit more at times. It will be interesting to see whether this holds as Windows 10 and the drivers are finalized, and even longer term whether these benefits are retained by Vulkan.
As for the NVIDIA cards, NVIDIA sees neither quite the awesome relative performance gains from DirectX 12 nor enough absolute performance to top the charts, but here too we see the benefits of DirectX 12 in full force. At 1.9M draw calls per second in DX11ST and 2.2M draw calls per second in DX11MT, NVIDIA starts out in a much better position than AMD does; in the latter they essentially can double AMD’s DX11MT throughput (or alternatively have half the API overhead).
Once DX12 comes into play though, NVIDIA’s throughput rockets through the roof as well. The GTX 980 sees an 8.2x increase over DX11ST, and a 7x increase over DX11MT. On an absolute basis the GTX 980 is consuming 15.5M draw calls per second (or about 250K per frame at 60fps), showing that even the best DX11 implementation can’t hold a candle to this early DirectX 12 implementation. The benefits of DirectX 12 really are that great for draw call performance.
Like AMD, NVIDIA seems to be command processor limited here. GPU-Z reports 100% GPU usage in the DX12 test, indicating that by NVIDIA’s internal metrics the card is working as hard as it can. Meanwhile though not charted, I also tested a GTX Titan X here, which achieved virtually the exact same results as the GTX 980. In lieu of more evidence to support being CPU bound, I have to assume that the GM200 GPU uses a similar command processor as the GM204 based GTX 980, leading to a similar bottleneck. Which would make some sense, as the GM200 is by all practical measurements a supersized version of GM204.
Moving down the NVIDIA lineup, we see performance decrease as we work towards the GTX 680 and GTX 750 Ti. The latter is a newer product, based on the GM107 GPU, but ultimately it is a smaller and lower performing GPU than the GTX 680. Regardless, we are hitting the lower command processor throughput limits of these cards, and seeing the maximum DX12 throughput decrease accordingly. This means that the relative gains are smaller – DX11 performance is virtually the same as GTX 980 since the CPU is the limit there – but even GTX 750 Ti sees a 3.8x increase in throughput over DX11ST.
Finally, it’s here where we’re seeing a distinct case of the DX11 test producing variable results. For the NVIDIA cards we have seen our results fluctuate between 1.4M and 1.9M. Of all of our runs 1.9M is more common – not to mention it’s close to the score we get on NVIDIA’s public WDDM 1.3 drivers – so it’s what we’re publishing here. However for whatever reason, 1.4M will become more common with fewer cores even though the bottleneck was (and remains) single-core performance.
As for performance scaling with 4 cores, it’s very similar to what we saw with 6 cores. As we noted in our CPU-centric look at our data, only the fastest cards benefit from 6 cores, so the performance we see with 4 cores is quite similar to what we saw before. AMD of course still sees the greatest gains, while overall the gap between AMD and NVIDIA is compressed some.
Interestingly Mantle’s performance advantage melts away here. DirectX 12 is now the fastest API for all AMD cards, indicating that DX12 scales out better to 4 cores than Mantle, but perhaps not as well to 6 cores.
Finally with 2 cores many of our configurations are CPU limited. The baseline changes a bit – DX11MT ceases to be effective since 1 core must be reserved for the display driver – and the fastest cards have lost quite a bit of performance here. None the less, the AMD cards can still hit 10M+ draw calls per second with just 2 cores, and the GTX 980/680 are close behind at 9.4M draw calls per second. Which is again a minimum 6.7x increase in draw call throughput versus DirectX 11, showing that even on relatively low performance CPUs the draw call gains from DirectX 12 are substantial.
Overall then, with 6 CPU cores in play AMD appears to have an edge in command processor performance, allowing them to sustain a higher draw call throughput than NVIDIA. That said, as we know the real world performance of the GTX 980 easily surpasses the R9 290X, which is why it’s important to remember that this is a synthetic benchmark. Meanwhile at 2 cores where we become distinctly CPU limited, AMD appears to still have an edge in DirectX 12 throughput, an interesting role reversal from their poorer DirectX 11 performance.
Integrated GPU Testing
Switching gears from high performance discrete GPUs, we have our integrated GPUs. From a high level overview the gains from DirectX 12 are not going to be quite as large here as they are with dGPUs due to the much lower GPU performance, but there is still ample opportunity benefit from increased draw call performance.
Here we have Intel’s Haswell CPUs, and AMD’s Kaveri APUs. We'll start off with the higher-end processors, the Intel Core i3/i5i7 and AMD A10/A8.
As expected, at the high-end the performance gains from DirectX 12 are not quite as great as they were with the dGPUs, but we’re still seeing significant gains. The largest gains of course are found with the AMD processors, thanks to their much stronger iGPUs. From DX11ST to DX12 we’re seeing a surprisingly large 6.8x increase in draw call performance, from 655K to 4,470K.
As to be expected, with a relatively weak CPU, AMD’s DX11 draw call performance isn’t very strong here relative to their strong GPU and of course our more powerful dGPUs. Still, it ends up being better than Intel (who otherwise has the stronger CPU), so we see AMD offering better draw call throughput at all levels. Ultimately what this amounts to is that AMD has quite a bit more potential under DX12.
Mantle meanwhile delivers a very slight edge over DX12 here, although for all practical purposes the two should be considered tied.
Meanwhile for the Intel CPUs, the gains from DX12 aren’t quite as large as with the AMD processors, but they’re still significant, and this is why Intel is happily backing DX12. All 3 processors share the same GT2 GPU and see similar gains. Starting from a baseline of 625K draw calls under DX11 – almost identical to AMD – the i7-4790K jumps up by 3.2x to 2,033K draw calls under DX12. The i5 and the i3 processors see 1,977K and 1,874K respectively, and after adjusting for clockspeeds it’s clear that we’re GPU command processor limited at all times here, hence why even a 2 core i3 can deliver similar gains.
Intel does end up seeing the smallest gains here, but again even in this sort of worst case scenario of a powerful CPU paired with a weak CPU, DX12 still improved draw call performance by over 3.2x. This means that in the long run even games that are targeting lower-performance PCs still stand to see a major increase in the number of draw calls they can use thanks to DirectX 12.
The story is much the same with our lower performance processors. AMD continues to see the largest gains and largest absolute performance under DirectX 12. With a 7x performance increase for the A8, even this weaker processor benefits greatly from the use of a low-level API.
The Intel processors see smaller gains as well, but they too are similarly significant. Even the Pentium with its basic GT1 processor and pair of relatively low clocked CPU sees a 2.7x increase in draw call performance from DirectX 12.
Closing Thoughts
Wrapping things up, Futuremark’s latest benchmark certainly gives us a new view on DirectX 12, and of course another data point in looking at the performance of the forthcoming API.
Since being announced last year – and really, since Mantle was announced in 2013 – the initial focus on low-level APIs has been on draw call throughput, and for good reason. The current high-level API paradigm has significant CPU overhead and at the same time fails to scale well with multiple CPU cores, leading to a sort of worst-case scenario for trying to push draw calls. At the same time console developers have low enjoyed lower-level access and the accompanying improvement in draw calls, a benefit that is an issue for the PC in the age of so many multiplatform titles.
DirectX 12 then will be a radical overhaul to how GPU programming works, but at its most basic level it’s a fix for the draw call problem. And as we’ve seen in Star Swarm and now the 3DMark API Overhead Feature Test, the results are nothing short of dramatic. With the low-level API offering a 10x-20x increase in draw call throughput, any sort of draw call problems the PC was facing with high-level APIs is thoroughly put to rest by the new API. With the ability to push upwards of 20 million draw calls per second, PC developers should finally be able to break away from doing tricks to minimize draw calls in the name of performance and focus on other aspects of game design.
GDC 2014 - DirectX 12 Unveiled: 3DMark 2011 CPU Time: Direct3D 11 vs. Direct3D 12
Of course at the same time we need to be clear that 3DMark’s API Overhead Feature Test is a synthetic test – and is so by design – so the performance we’re looking at today is just one small slice of the overall performance picture. Real world game performance gains will undoubtedly be much smaller, especially if games aren’t using a large number of draw calls in the first place. But the important part is that it sets the stage for future games to use a much larger number of draw calls and/or spend less time trying to minimize the number of calls. And of course we can’t ignore the multi-threading benefits from DirectX 12, as while multi-threaded games are relatively old now, the inability to scale up throughput with additional cores has always been an issue that DirectX 12 will help to solve.
Ultimately we’re looking at just one test, and a synthetic test at that, but as gamers if we want better understand why game developers such as Johan Andersson have been pushing so hard for low-level APIs, the results of this benchmark are exactly why. From discrete to integrated, top to bottom, every performance level of PC stands to gain from DirectX 12, and for virtually all of them the draw call gains are going to be immense. DirectX 12 won’t change the world, but it will change the face of game programming for the better, and it will be very interesting to see just what developers can do with the API starting later this year.