Name: 64 Cores of Rendering Madness: The AMD Threadripper Pro 3995WX Review
Item: 64 Cores of Rendering Madness: The AMD Threadripper Pro 3995WX Review
Author: Dr. Ian Cutress

Original Link: https://www.anandtech.com/show/16478/64-cores-of-rendering-madness-the-amd-threadripper-pro-3995wx-review

64 Cores of Rendering Madness: The AMD Threadripper Pro 3995WX Review

VIEW ARTICLE

by Dr. Ian Cutress on February 9, 2021 9:00 AM EST

118 Comments

Knowing your market is a key fundamental of product planning, marketing, and distribution. There’s no point creating a product with no market, or finding you have something amazing but offer it to the wrong sort of customers. When AMD started offering high-core count Threadripper processors, the one market that took as many as they could get was the graphics design business – visual effects companies and those focused on rendering loved the core count, the memory support, all the PCIe lanes, and the price. But if there’s one thing more performance brings, it’s the desire for even more performance. Enter Threadripper Pro.

computational graphics goes brrrrrrr

There are a number of industries that, when looking from the outside, an enthusiast might assume that using a CPU is probably old fashioned – the question is asked as to why hasn’t that industry moved fully to using GPU accelerators? One of the big ones is machine learning – despite the push to dedicated machine learning hardware and lots of big businesses doing ML on GPUs, most machine learning today is still done on CPUs. The same is still true with graphics and visual effects.

The reason behind this typically comes down to the software packages in use, and the programmers in charge.

Developing software for CPUs is easy, because that is what most people are trained on. Optimization packages for CPUs are well established, and even for upcoming specialist instructions, these can be developed in simulated environments. A CPU is designed to handle almost anything thrown at it, even super bad code.

By contrast, GPU compute is harder. It isn’t as difficult as it used to be, as there are wide arrays of libraries that enable GPU compilation without having to know too much about how to program for a GPU, however the difficulty lies in architecting the workload to take advantage of what a GPU has to offer. A GPU is a massive engine that performs the same operation to hundreds of parallel threads at the same time – it also has a very small cache and accesses to GPU memory are long, so that latency is hidden by having even more threads in flight at once. If the compute part of the software isn’t amenable to that sort of workload, such as being structurally more linear, then spending 6 months redeveloping for a GPU is a wasted effort. Or even if the math works out better on GPU, trying to rebuild a 20-year old codebase (or older) for GPUs still requires a substantial undertaking by a group of experts.

GPU compute is coming on leaps and bounds ever since I did it in the late 2000s. But the fact remains is that there are still a number of industries that are a mix of CPU/GPU throughput. These include machine learning, oil and gas, financial, medical, and the one we’re focusing on today is visual effects.

A visual effects design and rendering workload is a complex mix of dedicated software platforms and plugins. Software like Cinema4D, Blender, Maya, and others rely on the GPU to showcase a partially rendered scene for these artists to work on in real time, also relying on strong single core performance, but the bulk of compute for the final render will depend on what plugins are being used for that particular product. Some plugins are GPU accelerated, such as Blender Cycles, and the move to more GPU-accelerated workloads is taking its time – ray tracing accelerated design is an area that is getting a lot of GPU attention, for example.

There are always questions as to which method produces the best image – there’s no point using a GPU to accelerate the rendering time if it adds additional noise or reduces the quality. A film studio is more than likely to prioritize a slow higher-quality render on CPUs than a fast noisy one on GPUs, or alternatively, render a lower resolution image and then upscale with trained AI. Based on our conversations with OEMs that supply the industry, we've been told that a number of studios will outright say that rendering their workflow on a CPU is the only way they do it. The other angle is memory, as the right CPU can have 256 GB to 4 TB of DRAM available, whereas the best GPUs can only supply 80 GB (and those are the super expensive ones).

The point I’m making here is that VFX studios still prefer CPU compute, and the more the better. When AMD launched its new Zen-based processors, particularly the 32 and 64 core count models, these were immediately earmarked as potential replacements for the Xeons being used in these VFX studios. AMD’s parts prioritized FP compute, a key element in VFX design, and having double the cores per socket was also a winner, combined with the large amount of cache per core. This latter part meant that even though the first high-core count parts had a non-uniform memory architecture, it wasn’t as much of an issue as with some other compute processes.

A number of VFX companies as far as we understand focused on AMD’s Threadripper platform over the corresponding EPYC. When both of these parts first arrived to market, it was very easy for VFX studios to invest in under-the-desk workstations built on Threadripper, while EPYC was more for the server rack installations and not so much for workstations. Roll around to Threadripper 3000, and EPYC 7002, and now there are 64 cores, 64 PCIe 4.0 lanes, and lots of choice. VFX studios still went for Threadripper, mostly due to offering higher power 280 W in something that could easily be sourced by system integrators like Armari that specialize in high-compute under-desk systems. They also asked AMD for more.

AMD has now rolled out its Threadripper Pro platform, addressing some of these requirements. While VFX is always core compute focused, the TR Pro now gives double the PCIe lanes, double the memory bandwidth, support for up to 2TB of memory, and Pro-level admin support. These PCIe lanes could be extended to local storage (always important in VFX) as well as large RAMDisks, and the admin support through DASH helps keep the company systems managed together appropriately. AMD’s Memory Guard is also in its Pro line of parts, which is designed to enable full memory encryption.

Beyond VFX, AMD has cited world leadership compute with TR Pro for product engineering with Creo, 3D visualization with KeyShot, model design in architecture with Autodesk Revit, and data science, such as oil and gas dataset analysis, where the datasets are growing into the hundreds of GB and require substantial compute support.

Threadripper Pro vs Workstation EPYC (WEPYC)

Looking at the benefits that these new processors provide, it’s clear to see that these are more Workstation-style EPYC parts than ‘enhanced’ Threadrippers. Here’s a breakdown:

AMD Zen 2 High-End Comparison
AnandTech	Threadripper	Threadripper Pro	Enterprise EPYC
Cores	32-64	12-64	8-64
1P Flagship	TR 3990X	TR Pro 3995WX	EPYC 7702P
MSRP	$3990	$5490	$4425
TDP	280 W	280 W	200 W
Base Freq	2900 MHz	2700 MHz	2000 MHz
Turbo Freq	4300 MHz	4200 MHz	3350 MHz
Socket	sTRX40	sTRX4: WRX80	SP3
L3 Cache	256 MB	256 MB	256 MB
DRAM	4 x DDR4-3200	8 x DDR4-3200	8 x DDR4-3200
DRAM Capacity	256 GB	2 TB, ECC	4 TB, ECC
PCIe	4.0 x56 + chipset	4.0 x120 + chipset	4.0 x128
Pro Features	No	Yes	Yes

To get these new parts starting from EPYC, all AMD had to do was raise the TDP to 280 W, and cut the DRAM support. If we start from a Threadripper base, there are 3-4 substantial changes. So why is this called Threadripper Pro, and not Workstation EPYC?

We come back to the VFX studios again. Having already bought in to the Threadripper branding and way of thinking, keeping these parts as Threadripper helps smooth that transition – this vertical had kind of already said they preferred Threadripper over EPYC, from what we are told, and so keeping the naming consistent means that there is no real re-education to do.

The other element is that the EPYC processor line is somewhat fractured: there are standard versions, high performance H models, high frequency F models, and then a series of custom designs under B, V, and others for specific customers. By keeping this new line as Threadripper Pro, it keeps it all under one umbrella.

Threadripper Pro Offerings: 12 core to 64 core

AMD announced these processors in the middle of last year, along with the Lenovo Thinkstation P620 as being the launch platform. From my experience, the Thinkstation line is very well designed, and we’re testing our 3995WX in a P620 today.

AMD Ryzen Threadripper Pro
AnandTech	Cores	Base Freq	Turbo Freq	Chiplets	L3 Cache	TDP	Price SEP
3995WX	64 / 128	2700	4200	8 + 1	256 MB	280 W	$5490
3975WX	32 / 64	3500	4200	4 + 1	128 MB	280 W	$2750
3955WX	16 / 32	3900	4300	2 + 1	64 MB	280 W	$1150
3945WX	12 / 24	4000	4300	2 + 1	64 MB	280 W	*
*Unsure if this is a special OEM model

When TR Pro was announced with Lenovo, we weren’t sure if any other OEM would have access to Threadripper. When we asked OEMs earlier in that year about it, before we even knew if TR Pro was a real thing, they stated that AMD hadn’t even marked the platform on their roadmap, which we reported at the time. We have since learned that Lenovo had the 6-month exclusive, and information was only supplied to other vendors (ASUS, GIGABYTE, Supermicro) after it had been announced.

To that end, AMD has since announced that Threadripper Pro is coming to retail, both for other OEMs to design systems, or for end-users to build their own. Despite using the same LGA4094 socket as the other Threadripper and EPYC processors, TR Pro will be locked down to WRX80 motherboards. We currently know of three, such as the Supermicro and GIGABYTE models, plus we have also had the ASUS Pro WS WRX80E-SAGE SE Wi-Fi model in house for a short hands-on, although we weren’t able to test it.

Of the four processors listed above, the top three are going on sale. It’s worth noting that only the 64-core comes with 256 MB of L3 cache, while the 32-core comes with 128 MB of L3. AMD has kept that these chiplet designs only use as many chipsets as is absolutely necessary, keeping L3 cache per core consistent as well as the 8-cores per chiplet (the EPYC product line varies this a bit).

The fourth processor, the 12-core, would appear to be an OEM-only specific processor for prebuilt systems.

Threadripper Pro versus The World

These Threadripper Pro offerings are designed to compete against two segments: first is AMD themselves, showcasing anyone who is using a high-end professional system built on first generation Zen hardware that there is a lot of performance to be had. The second is against Intel workstation customers, either using single socket Xeon W (which tops out at 28 cores), or a dual socket Xeon system that costs more or uses a lot more power, just because it is dual socket, but also has a non-uniform memory architecture.

We have almost all these in this test (we don't have the 7702P, but we do have the 7742), and realistically these are the only processors that should be considered if the 3995WX is an option for you:

3995WX Comparison Offerings
AnandTech	Core	SEP	1P 2P	TDP	Base Freq	Peak Freq	DDR	PCIe	DDR Cap
TR Pro 3995WX	64C	$5490	1P	280W	2700	4200	8x3200	128x 4.0	2 TB
TR 3990X	64C	$3990	1P	280W	2900	4300	4x3200	64x 4.0	¼ TB
EPYC 7702P	64C	$4425	1P	200W	2000	3350	8x3200	128x 4.0	4 TB
EPYC 7742	64C	$6950	2P	225W	2250	3400	8x3200	128x 4.0	4 TB
Xeon 6258R	28C	$3950	2P	205W	2700	4000	6x2933	48x 3.0	1 TB
Xeon W-3175X	28C	$2999	1P	255W	3100	4300	6x2933	48x 3.0	½ TB

Intel tops out at 28 cores, and there is no getting around that. Technically Intel has the AP processor line that goes up to 56 cores, however these are for specialist systems and we haven’t had one physically sent to us for testing. Those are also $20k+ per CPU, and are two CPUs in the same system bolted under one package.

The AMD comparison points are the best Threadripper option and the best available EPYC Processor, albeit the 2P version. The best comparison here would be the 7702P, the single socket variant and much more price competitive, however we haven’t got this in for testing, instead we have AMD's EPYC 7742, which is the dual socket version but slightly higher performance.

Test Setup
AMD TR Pro	TR Pro 3995WX	Lenovo Thinkstation P620	BIOS S07K T0EA	Lenovo Custom	Kingston 8x16 GB DDR4-3200 ECC
AMD TR	TR 3990X	MSI Creator TRX40	BIOS 1.50	Thermaltake 280mm AIO	Corsair 4x8 GB DDR4-3200
AMD EPYC	EPYC 7742	Supermicro H11DSI	BIOS 2.1	Noctua NH-U14S TR4-SP3	SK Hynix 16x32 GB DDR4-3200 ECC
Intel Xeon	Xeon Gold 6258R	ASUS ROG Dominus Extreme	BIOS 0601	Asetek 690LX-PN	SK Hynix 6x32 GB DDR4-2933 ECC
Intel Xeon	Xeon W-3175X	ASUS ROG Dominus Extreme	BIOS 0601	Asetek 690LX-PN	DDR4-2666 ECC
GPU	Sapphire RX 460 2GB (CPU Tests)
PSU	Various (inc. Corsair AX860i)
SSD	Crucial MX500 2TB
Silverstone SST-FHP141-VF 173 CFM fans also used. Nice and loud.

We must thank the following companies for kindly providing hardware for our multiple test beds. Some of this hardware is not in this test bed specifically, but is used in other testing.

Hardware Providers for CPU and Motherboard Reviews
Sapphire RX 460 Nitro	NVIDIA RTX 2080 Ti	Crucial SSDs	Corsair PSUs

Kingston DDR4 RDIMM	ADATA DDR4	Silverstone Coolers	Noctua Coolers

Users interested in the details of our current CPU benchmark suite can refer to our #CPUOverload article which covers the topics of benchmark automation as well as what our suite runs and why. We also benchmark much more data than is shown in a typical review, all of which you can see in our benchmark database. We call it ‘Bench’, and there’s also a link on the top of the website in case you need it for processor comparison in the future.

CPU Tests: Microbenchmarks

A y-Cruncher Sprint

The y-cruncher website has a large amount of benchmark data showing how different CPUs perform when calculating pi up to a given number of digits. Not only are the pi world records present, but below these there are a few CPUs showing the scaling of the hardware, where it shows the time to compute moving from 25 million digits to 50 million, 100 million, 250 million, and all the way up to 10 billion, to showcase how the performance scales with digits (assuming everything is in memory). This range of results, from 25 million to 250 billion, is something I’ve dubbed a ‘sprint’.

I have written some code in order to perform a sprint on every CPU we test. It detects the DRAM, works out the biggest value that can be calculated with that amount of memory, and works up starting from 25 million digits. For the tests that go up to the ~25 billion digits, it only adds an extra 15 minutes to the suite for an 8-core Ryzen CPU. With this test, we can see the effect of increasing memory requirements on the workload and the scaling factor for a workload such as this.

Longer lines indicate more memory installed in the system at the time

For this sprint, we’ve covered each result into how many million digits are calculated per second at each of the dataset sizes. The more cores a system has, the better the compute, and Intel gets an AVX-512 bonus here as well because the software can use AVX-512. But as the dataset gets larger, there is more shuffling of values back and forth between memory and cache, so being able to keep a high bandwidth while also a low latency to all cores is crucial in this test, especially as the test increases.

The 8-channel 64-core TR Pro 3995WX here does very well, peaking at around 80 million per second, and at the end of the test still being very fast. It sits above the EPYC 7742 here due to the fact that it has a higher TDP and frequency. They are both well above the Threadripper 3990X, which only has quad-channel memory, which is the reason for the decrease as the dataset increases.

The W-3175X from Intel has the AVX-512 advantage, which is why the 28 cores can compete with the 64 cores from AMD, however the six-channel memory bandwidth and probably the mesh quickly becomes a bottleneck as each core needs to feed those AVX-512 units. This is the sort of situation where in-package HBM is likely to make a big difference. But at the smaller dataset sizes at least the W-3175X can feed enough data across the mesh to the AVX-512 units for the peak throughput.

Core-to-Core Latency

As the core count of modern CPUs is growing, we are reaching a time when the time to access each core from a different core is no longer a constant. Even before the advent of heterogeneous SoC designs, processors built on large rings or meshes can have different latencies to access the nearest core compared to the furthest core. This rings true especially in multi-socket server environments.

But modern CPUs, even desktop and consumer CPUs, can have variable access latency to get to another core. For example, in the first generation Threadripper CPUs, we had four chips on the package, each with 8 threads, and each with a different core-to-core latency depending on if it was on-die or off-die. This gets more complex with products like Lakefield, which has two different communication buses depending on which core is talking to which.

If you are a regular reader of AnandTech’s CPU reviews, you will recognize our Core-to-Core latency test. It’s a great way to show exactly how groups of cores are laid out on the silicon. This is a custom in-house test built by Andrei, and we know there are competing tests out there, but we feel ours is the most accurate to how quick an access between two cores can happen.

Due to a test limitation, we’re only probing the first 64 threads of the system, but the scale out to 128 threads would be identical. This generation of Threadripper Pro is built on Zen 2, similar to Threadripper 3990X and the EPYC 7742, and so we only have quad-core CCXes in play here. A thread speaking to itself has a latency of around 7 nanoseconds, inside a quad-core CCX is around 18-19 nanoseconds, and then accessing any other core varies from 77-89 nanoseconds. Even accessing the CCX on the same chiplet has the same latency, as the communication is designed to ping out to the central IO die first. If Threadripper Pro gets boosted to Zen 3 for the next generation, this will be a big uplift as we’ve already seen with Zen 3. But TR Pro with Zen 3 might only be launched only when Zen 4 comes out, and we’ll be talking about that difference when that happens.

Frequency Ramping

Both AMD and Intel over the past few years have introduced features to their processors that speed up the time from when a CPU moves from idle into a high powered state. The effect of this means that users can get peak performance quicker, but the biggest knock-on effect for this is with battery life in mobile devices, especially if a system can turbo up quick and turbo down quick, ensuring that it stays in the lowest and most efficient power state for as long as possible.

Intel’s technology is called SpeedShift, although SpeedShift was not enabled until Skylake.

One of the issues though with this technology is that sometimes the adjustments in frequency can be so fast, software cannot detect them. If the frequency is changing on the order of microseconds, but your software is only probing frequency in milliseconds (or seconds), then quick changes will be missed. Not only that, as an observer probing the frequency, you could be affecting the actual turbo performance. When the CPU is changing frequency, it essentially has to pause all compute while it aligns the frequency rate of the whole core.

We wrote an extensive review analysis piece on this, called ‘Reaching for Turbo: Aligning Perception with AMD’s Frequency Metrics’, due to an issue where users were not observing the peak turbo speeds for AMD’s processors.

We got around the issue by making the frequency probing the workload causing the turbo. The software is able to detect frequency adjustments on a microsecond scale, so we can see how well a system can get to those boost frequencies. Our Frequency Ramp tool has already been in use in a number of reviews.

The frequency ramp here is around one millisecond, indicative of AMD implementing its CPPC2 management design.

Power Consumption

The nature of reporting processor power consumption has become, in part, a dystopian nightmare. Historically the peak power consumption of a processor, as purchased, is given by its Thermal Design Power (TDP, or PL1). For many markets, such as embedded processors, that value of TDP still signifies the peak power consumption. For the processors we test at AnandTech, either desktop, notebook, or enterprise, this is not always the case.

Modern high performance processors implement a feature called Turbo. This allows, usually for a limited time, a processor to go beyond its rated frequency. Exactly how far the processor goes depends on a few factors, such as the Turbo Power Limit (PL2), whether the peak frequency is hard coded, the thermals, and the power delivery. Turbo can sometimes be very aggressive, allowing power values 2.5x above the rated TDP.

AMD and Intel have different definitions for TDP, but are broadly speaking applied the same. The difference comes to turbo modes, turbo limits, turbo budgets, and how the processors manage that power balance. These topics are 10000-12000 word articles in their own right, and we’ve got a few articles worth reading on the topic.

In simple terms, processor manufacturers only ever guarantee two values which are tied together - when all cores are running at base frequency, the processor should be running at or below the TDP rating. All turbo modes and power modes above that are not covered by warranty. Intel kind of screwed this up with the Tiger Lake launch in September 2020, by refusing to define a TDP rating for its new processors, instead going for a range. Obfuscation like this is a frustrating endeavor for press and end-users alike.

However, for our tests in this review, we measure the power consumption of the processor in a variety of different scenarios. These include full AVX2/AVX512 (delete as applicable) workflows, real-world image-model construction, and others as appropriate. These tests are done as comparative models. We also note the peak power recorded in any of our tests.

AMD Ryzen Threadripper Pro 3995WX

The specifications for this processor list 64 cores running at a TDP of 280 W. In our testing, we never saw any power consumption over 280 W:

(0-0) Peak Power

Going through our POV-Ray scaling power test for per-core consumption, we’re seeing a trend whereby 40% of the power goes to the non-core operation of the system, which is also likely to include the L3 cache.

Red = Full Package, Blue = CPU Core only (minus L3 we think)

We only hit the peak 280 W when we are at 56-core loading, otherwise it is a steady climb moving from 7 W/core in the early loading down to about 3 W/core when fully loaded. What this does for core frequencies is relatively interesting.

Our system starts around 4200 MHz, which is the rated turbo frequency, settling down to 4000-4050 MHz in that 8-core to 20-core loading. After 20 cores, it’s a slow decline at a rate of 25 MHz per extra core loaded, until at full CPU load we observe 3100 MHz on all cores. This is above the 2700 MHz base frequency, but also comes out to 2.86 W per core in CPU-only power, or 4.37 W per core if we also include non-CPU power. Note that non-CPU power in this case might also include the L3.

For an actual workload, our 3DPMavx test is a bit more aggressive than POV-Ray, cycling to full load for ten seconds for each of its six algorithms then idling for a short time. In this test we saw idle frequencies of 2700 MHz, but all-core loading was at least 2900 MHz up to 3200 MHz. Power again was very much limited to 280 W.

CPU Tests: Rendering

Rendering tests, compared to others, are often a little more simple to digest and automate. All the tests put out some sort of score or time, usually in an obtainable way that makes it fairly easy to extract. These tests are some of the most strenuous in our list, due to the highly threaded nature of rendering and ray-tracing, and can draw a lot of power. If a system is not properly configured to deal with the thermal requirements of the processor, the rendering benchmarks is where it would show most easily as the frequency drops over a sustained period of time. Most benchmarks in this case are re-run several times, and the key to this is having an appropriate idle/wait time between benchmarks to allow for temperatures to normalize from the last test.

Blender 2.83 LTS: Link

One of the popular tools for rendering is Blender, with it being a public open source project that anyone in the animation industry can get involved in. This extends to conferences, use in films and VR, with a dedicated Blender Institute, and everything you might expect from a professional software package (except perhaps a professional grade support package). With it being open-source, studios can customize it in as many ways as they need to get the results they require. It ends up being a big optimization target for both Intel and AMD in this regard.

For benchmarking purposes, we fell back to one rendering a frame from a detailed project. Most reviews, as we have done in the past, focus on one of the classic Blender renders, known as BMW_27. It can take anywhere from a few minutes to almost an hour on a regular system. However now that Blender has moved onto a Long Term Support model (LTS) with the latest 2.83 release, we decided to go for something different.

We use this scene, called PartyTug at 6AM by Ian Hubert, which is the official image of Blender 2.83. It is 44.3 MB in size, and uses some of the more modern compute properties of Blender. As it is more complex than the BMW scene, but uses different aspects of the compute model, time to process is roughly similar to before. We loop the scene for at least 10 minutes, taking the average time of the completions taken. Blender offers a command-line tool for batch commands, and we redirect the output into a text file.

(4-1) Blender 2.83 Custom Render Test

The first benchmark out of the gate and a win is scored by the TR Pro. It's only a small win at around 3%, but it showcases that the eight memory channels outweigh the extra frequency of regular Threadripper here.

Corona 1.3: Link

Corona is billed as a popular high-performance photorealistic rendering engine for 3ds Max, with development for Cinema 4D support as well. In order to promote the software, the developers produced a downloadable benchmark on the 1.3 version of the software, with a ray-traced scene involving a military vehicle and a lot of foliage. The software does multiple passes, calculating the scene, geometry, preconditioning and rendering, with performance measured in the time to finish the benchmark (the official metric used on their website) or in rays per second (the metric we use to offer a more linear scale).

The standard benchmark provided by Corona is interface driven: the scene is calculated and displayed in front of the user, with the ability to upload the result to their online database. We got in contact with the developers, who provided us with a non-interface version that allowed for command-line entry and retrieval of the results very easily. We loop around the benchmark five times, waiting 60 seconds between each, and taking an overall average. The time to run this benchmark can be around 10 minutes on a Core i9, up to over an hour on a quad-core 2014 AMD processor or dual-core Pentium.

(4-2) Corona 1.3 Benchmark

Another small 3% win in Corona, and almost double what Intel is offering.

Crysis CPU-Only Gameplay

One of the most oft used memes in computer gaming is ‘Can It Run Crysis?’. The original 2007 game, built in the Crytek engine by Crytek, was heralded as a computationally complex title for the hardware at the time and several years after, suggesting that a user needed graphics hardware from the future in order to run it. Fast forward over a decade, and the game runs fairly easily on modern GPUs.

But can we also apply the same concept to pure CPU rendering? Can a CPU, on its own, render Crysis? Since 64 core processors entered the market, one can dream. So we built a benchmark to see whether the hardware can.

For this test, we’re running Crysis’ own GPU benchmark, but in CPU render mode. This is a 2000 frame test, with medium and low settings.

(4-3a) Crysis CPU Render at 320x200 Low (4-3b) Crysis CPU Render at 1080p Low

The Crytek engine used for Crysis has two key limitations: up to 32 threads, and up to 23 cores. As a result our high-end CPUs here are pegged to those cores, and single-thread limits come into play. The TR 3995WX has a healthy lead over the EPYC 7742, but loses out against the mainstream processors.

POV-Ray 3.7.1: Link

A long time benchmark staple, POV-Ray is another rendering program that is well known to load up every single thread in a system, regardless of cache and memory levels. After a long period of POV-Ray 3.7 being the latest official release, when AMD launched Ryzen the POV-Ray codebase suddenly saw a range of activity from both AMD and Intel, knowing that the software (with the built-in benchmark) would be an optimization tool for the hardware.

We had to stick a flag in the sand when it came to selecting the version that was fair to both AMD and Intel, and still relevant to end-users. Version 3.7.1 fixes a significant bug in the early 2017 code that was advised against in both Intel and AMD manuals regarding to write-after-read, leading to a nice performance boost.

The benchmark can take over 20 minutes on a slow system with few cores, or around a minute or two on a fast system, or seconds with a dual high-core count EPYC. Because POV-Ray draws a large amount of power and current, it is important to make sure the cooling is sufficient here and the system stays in its high-power state. Using a motherboard with a poor power-delivery and low airflow could create an issue that won’t be obvious in some CPU positioning if the power limit only causes a 100 MHz drop as it changes P-states.

(4-4) POV-Ray 3.7.1

POV-Ray is another 3% win for the TR Pro 3995WX.

V-Ray: Link

We have a couple of renderers and ray tracers in our suite already, however V-Ray’s benchmark came through for a requested benchmark enough for us to roll it into our suite. Built by ChaosGroup, V-Ray is a 3D rendering package compatible with a number of popular commercial imaging applications, such as 3ds Max, Maya, Undreal, Cinema 4D, and Blender.

We run the standard standalone benchmark application, but in an automated fashion to pull out the result in the form of kilosamples/second. We run the test six times and take an average of the valid results.

(4-5) V-Ray Renderer

Less than 0.3% win here this time, but AMD's 64-core offerings are still ahead of the game.

Cinebench R20: Link

Another common stable of a benchmark suite is Cinebench. Based on Cinema4D, Cinebench is a purpose built benchmark machine that renders a scene with both single and multi-threaded options. The scene is identical in both cases. The R20 version means that it targets Cinema 4D R20, a slightly older version of the software which is currently on version R21. Cinebench R20 was launched given that the R15 version had been out a long time, and despite the difference between the benchmark and the latest version of the software on which it is based, Cinebench results are often quoted a lot in marketing materials.

Results for Cinebench R20 are not comparable to R15 or older, because both the scene being used is different, but also the updates in the code bath. The results are output as a score from the software, which is directly proportional to the time taken. Using the benchmark flags for single CPU and multi-CPU workloads, we run the software from the command line which opens the test, runs it, and dumps the result into the console which is redirected to a text file. The test is repeated for a minimum of 10 minutes for both ST and MT, and then the runs averaged.

(4-6a) CineBench R20 Single Thread (4-6b) CineBench R20 Multi-Thread

Cinebench ST scores are clearly in the realm of the mainstream processors, but when firing up all the threads, the TR Pro 3995WX takes a 4% lead over the standard 3990X, and scores 2.33x more than the mainstream 16-core Ryzen 9.

CPU Tests: Encoding

One of the interesting elements on modern processors is encoding performance. This covers two main areas: encryption/decryption for secure data transfer, and video transcoding from one video format to another.

In the encrypt/decrypt scenario, how data is transferred and by what mechanism is pertinent to on-the-fly encryption of sensitive data - a process by which more modern devices are leaning to for software security.

Video transcoding as a tool to adjust the quality, file size and resolution of a video file has boomed in recent years, such as providing the optimum video for devices before consumption, or for game streamers who are wanting to upload the output from their video camera in real-time. As we move into live 3D video, this task will only get more strenuous, and it turns out that the performance of certain algorithms is a function of the input/output of the content.

HandBrake 1.32: Link

Video transcoding (both encode and decode) is a hot topic in performance metrics as more and more content is being created. First consideration is the standard in which the video is encoded, which can be lossless or lossy, trade performance for file-size, trade quality for file-size, or all of the above can increase encoding rates to help accelerate decoding rates. Alongside Google's favorite codecs, VP9 and AV1, there are others that are prominent: H264, the older codec, is practically everywhere and is designed to be optimized for 1080p video, and HEVC (or H.265) that is aimed to provide the same quality as H264 but at a lower file-size (or better quality for the same size). HEVC is important as 4K is streamed over the air, meaning less bits need to be transferred for the same quality content. There are other codecs coming to market designed for specific use cases all the time.

Handbrake is a favored tool for transcoding, with the later versions using copious amounts of newer APIs to take advantage of co-processors, like GPUs. It is available on Windows via an interface or can be accessed through the command-line, with the latter making our testing easier, with a redirection operator for the console output.

We take the compiled version of this 16-minute YouTube video about Russian CPUs at 1080p30 h264 and convert into three different files: (1) 480p30 ‘Discord’, (2) 720p30 ‘YouTube’, and (3) 4K60 HEVC.

(5-1a) Handbrake 1.3.2, 1080p30 H264 to 480p Discord (5-1b) Handbrake 1.3.2, 1080p30 H264 to 720p YouTube (5-1c) Handbrake 1.3.2, 1080p30 H264 to 4K60 HEVC

For the lower resolution modes, it would appear that the increased memory bandwidth plays a role for the 3995WX and 7742, although single core frequency also means a lot. Moving to the HEVC metrics, the Ryzen 9 takes a win here, but the 3995WX still goes above the 3990X.

7-Zip 1900: Link

The first compression benchmark tool we use is the open-source 7-zip, which typically offers good scaling across multiple cores. 7-zip is the compression tool most cited by readers as one they would rather see benchmarks on, and the program includes a built-in benchmark tool for both compression and decompression.

Example Test Run on an Intel 10-core i7-6950X

The tool can either be run from inside the software or through the command line. We take the latter route as it is easier to automate, obtain results, and put through our process. The command line flags available offer an option for repeated runs, and the output provides the average automatically through the console. We direct this output into a text file and regex the required values for compression, decompression, and a combined score.

(5-2c) 7-Zip 1900 Combined Score

This is a 16.6% win for the TR Pro 3995WX.

AES Encoding

Algorithms using AES coding have spread far and wide as a ubiquitous tool for encryption. Again, this is another CPU limited test, and modern CPUs have special AES pathways to accelerate their performance. We often see scaling in both frequency and cores with this benchmark. We use the latest version of TrueCrypt and run its benchmark mode over 1GB of in-DRAM data. Results shown are the GB/s average of encryption and decryption.

(5-3) AES Encoding

WinRAR 5.90: Link

For the 2020 test suite, we move to the latest version of WinRAR in our compression test. WinRAR in some quarters is more user friendly that 7-Zip, hence its inclusion. Rather than use a benchmark mode as we did with 7-Zip, here we take a set of files representative of a generic stack

33 video files , each 30 seconds, in 1.37 GB,
2834 smaller website files in 370 folders in 150 MB,
100 Beat Saber music tracks and input files, for 451 MB

This is a mixture of compressible and incompressible formats. The results shown are the time taken to encode the file. Due to DRAM caching, we run the test for 20 minutes times and take the average of the last five runs when the benchmark is in a steady state.

For automation, we use AHK’s internal timing tools from initiating the workload until the window closes signifying the end. This means the results are contained within AHK, with an average of the last 5 results being easy enough to calculate.

(5-4) WinRAR 5.90 Test, 3477 files, 1.96 GB

WinRAR is variable threaded, but the Xeon Gold takes the win here - even compared to the Xeon W-3175X. It's all relatively close at the top end.

CPU Tests: Office and Science

Our previous set of ‘office’ benchmarks have often been a mix of science and synthetics, so this time we wanted to keep our office section purely on real world performance.

Agisoft Photoscan 1.3.3: link

The concept of Photoscan is about translating many 2D images into a 3D model - so the more detailed the images, and the more you have, the better the final 3D model in both spatial accuracy and texturing accuracy. The algorithm has four stages, with some parts of the stages being single-threaded and others multi-threaded, along with some cache/memory dependency in there as well. For some of the more variable threaded workload, features such as Speed Shift and XFR will be able to take advantage of CPU stalls or downtime, giving sizeable speedups on newer microarchitectures.

For the update to version 1.3.3, the Agisoft software now supports command line operation. Agisoft provided us with a set of new images for this version of the test, and a python script to run it. We’ve modified the script slightly by changing some quality settings for the sake of the benchmark suite length, as well as adjusting how the final timing data is recorded. The python script dumps the results file in the format of our choosing. For our test we obtain the time for each stage of the benchmark, as well as the overall time.

(1-1) Agisoft Photoscan 1.3, Complex Test

The high core count and high memory bandwidth put the wins onto AMD here, and the 3995WX is +13.3% faster compared to the standard Threadripper. The difference back to EPYC is +28.5%.

Application Opening: GIMP 2.10.18

First up is a test using a monstrous multi-layered xcf file to load GIMP. While the file is only a single ‘image’, it has so many high-quality layers embedded it was taking north of 15 seconds to open and to gain control on the mid-range notebook I was using at the time.

What we test here is the first run - normally on the first time a user loads the GIMP package from a fresh install, the system has to configure a few dozen files that remain optimized on subsequent opening. For our test we delete those configured optimized files in order to force a ‘fresh load’ each time the software in run. As it turns out, GIMP does optimizations for every CPU thread in the system, which requires that higher thread-count processors take a lot longer to run.

We measure the time taken from calling the software to be opened, and until the software hands itself back over to the OS for user control. The test is repeated for a minimum of ten minutes or at least 15 loops, whichever comes first, with the first three results discarded.

(1-2) AppTimer: GIMP 2.10.18

Our GIMP test here scales out with core count, so a 64C processor has 4x the work of a 16C processor. That means the smaller core-count parts take the win.

Science

In this version of our test suite, all the science focused tests that aren’t ‘simulation’ work are now in our science section. This includes Brownian Motion, calculating digits of Pi, molecular dynamics, and for the first time, we’re trialing an artificial intelligence benchmark, both inference and training, that works under Windows using python and TensorFlow. Where possible these benchmarks have been optimized with the latest in vector instructions, except for the AI test – we were told that while it uses Intel’s Math Kernel Libraries, they’re optimized more for Linux than for Windows, and so it gives an interesting result when unoptimized software is used.

3D Particle Movement v2.1: Non-AVX and AVX2/AVX512

This is the latest version of this benchmark designed to simulate semi-optimized scientific algorithms taken directly from my doctorate thesis. This involves randomly moving particles in a 3D space using a set of algorithms that define random movement. Version 2.1 improves over 2.0 by passing the main particle structs by reference rather than by value, and decreasing the amount of double->float->double recasts the compiler was adding in.

The initial version of v2.1 is a custom C++ binary of my own code, and flags are in place to allow for multiple loops of the code with a custom benchmark length. By default this version runs six times and outputs the average score to the console, which we capture with a redirection operator that writes to file.

For v2.1, we also have a fully optimized AVX2/AVX512 version, which uses intrinsics to get the best performance out of the software. This was done by a former Intel AVX-512 engineer who now works elsewhere. According to Jim Keller, there are only a couple dozen or so people who understand how to extract the best performance out of a CPU, and this guy is one of them. To keep things honest, AMD also has a copy of the code, but has not proposed any changes.

The 3DPM test is set to output millions of movements per second, rather than time to complete a fixed number of movements.

(2-1) 3D Particle Movement v2.1 (non-AVX) (2-2) 3D Particle Movement v2.1 (Peak AVX)

Over the EPYC 7742 we see a +15.6% in AVX mode, but a +25.2% gain in non-AVX mode. The Intel CPUs have AVX-512 which is why they sprint off in the peak AVX test.

y-Cruncher 0.78.9506: www.numberworld.org/y-cruncher

If you ask anyone what sort of computer holds the world record for calculating the most digits of pi, I can guarantee that a good portion of those answers might point to some colossus super computer built into a mountain by a super-villain. Fortunately nothing could be further from the truth – the computer with the record is a quad socket Ivy Bridge server with 300 TB of storage. The software that was run to get that was y-cruncher.

Built by Alex Yee over the last part of a decade and some more, y-Cruncher is the software of choice for calculating billions and trillions of digits of the most popular mathematical constants. The software has held the world record for Pi since August 2010, and has broken the record a total of 7 times since. It also holds records for e, the Golden Ratio, and others. According to Alex, the program runs around 500,000 lines of code, and he has multiple binaries each optimized for different families of processors, such as Zen, Ice Lake, Sky Lake, all the way back to Nehalem, using the latest SSE/AVX2/AVX512 instructions where they fit in, and then further optimized for how each core is built.

For our purposes, we’re calculating Pi, as it is more compute bound than memory bound. In single thread mode we calculate 250 million digits, while in multithreaded mode we go for 2.5 billion digits. That 2.5 billion digit value requires ~12 GB of DRAM, and so is limited to systems with at least 16 GB.

(2-4b) yCruncher 0.78.9506 MT (250m Pi)

(2-4) yCruncher 0.78.9506 MT (2.5b Pi)

Further to the y-Cruncher sprint earlier in the review, our test here shows advantages for the systems with more memory channels as well as good mesh frequencies.

NAMD 2.13 (ApoA1): Molecular Dynamics

One of the popular science fields is modeling the dynamics of proteins. By looking at how the energy of active sites within a large protein structure over time, scientists behind the research can calculate required activation energies for potential interactions. This becomes very important in drug discovery. Molecular dynamics also plays a large role in protein folding, and in understanding what happens when proteins misfold, and what can be done to prevent it. Two of the most popular molecular dynamics packages in use today are NAMD and GROMACS.

NAMD, or Nanoscale Molecular Dynamics, has already been used in extensive Coronavirus research on the Frontier supercomputer. Typical simulations using the package are measured in how many nanoseconds per day can be calculated with the given hardware, and the ApoA1 protein (92,224 atoms) has been the standard model for molecular dynamics simulation.

Luckily the compute can home in on a typical ‘nanoseconds-per-day’ rate after only 60 seconds of simulation, however we stretch that out to 10 minutes to take a more sustained value, as by that time most turbo limits should be surpassed. The simulation itself works with 2 femtosecond timesteps. We use version 2.13 as this was the recommended version at the time of integrating this benchmark into our suite. The latest nightly builds we’re aware have started to enable support for AVX-512, however due to consistency in our benchmark suite, we are retaining with 2.13. Other software that we test with has AVX-512 acceleration.

This test also limits itself to 64 threads.

(2-5) NAMD ApoA1 Simulation

At the 64 thread limit, the 3995WX has a good +20% performance gain over the standard TR 3990X, although AMD claims a good 10 ns/day when the chip can process to its fullest.

AI Benchmark 0.1.2 using TensorFlow: Link

Finding an appropriate artificial intelligence benchmark for Windows has been a holy grail of mine for quite a while. The problem is that AI is such a fast moving, fast paced word that whatever I compute this quarter will no longer be relevant in the next, and one of the key metrics in this benchmarking suite is being able to keep data over a long period of time. We’ve had AI benchmarks on smartphones for a while, given that smartphones are a better target for AI workloads, but it also makes some sense that everything on PC is geared towards Linux as well.

Thankfully however, the good folks over at ETH Zurich in Switzerland have converted their smartphone AI benchmark into something that’s useable in Windows. It uses TensorFlow, and for our benchmark purposes we’ve locked our testing down to TensorFlow 2.10, AI Benchmark 0.1.2, while using Python 3.7.6.

The benchmark runs through 19 different networks including MobileNet-V2, ResNet-V2, VGG-19 Super-Res, NVIDIA-SPADE, PSPNet, DeepLab, Pixel-RNN, and GNMT-Translation. All the tests probe both the inference and the training at various input sizes and batch sizes, except the translation that only does inference. It measures the time taken to do a given amount of work, and spits out a value at the end.

There is one big caveat for all of this, however. Speaking with the folks over at ETH, they use Intel’s Math Kernel Libraries (MKL) for Windows, and they’re seeing some incredible drawbacks. I was told that MKL for Windows doesn’t play well with multiple threads, and as a result any Windows results are going to perform a lot worse than Linux results. On top of that, after a given number of threads (~16), MKL kind of gives up and performance drops of quite substantially.

So why test it at all? Firstly, because we need an AI benchmark, and a bad one is still better than not having one at all. Secondly, if MKL on Windows is the problem, then by publicizing the test, it might just put a boot somewhere for MKL to get fixed. To that end, we’ll stay with the benchmark as long as it remains feasible.

(2-6) AI Benchmark 0.1.2 Total

CPU Tests: Simulation

Simulation and Science have a lot of overlap in the benchmarking world, however for this distinction we’re separating into two segments mostly based on the utility of the resulting data. The benchmarks that fall under Science have a distinct use for the data they output – in our Simulation section, these act more like synthetics but at some level are still trying to simulate a given environment.

DigiCortex v1.35: link

DigiCortex is a pet project for the visualization of neuron and synapse activity in the brain. The software comes with a variety of benchmark modes, and we take the small benchmark which runs a 32k neuron/1.8B synapse simulation, similar to a small slug.

The results on the output are given as a fraction of whether the system can simulate in real-time, so anything above a value of one is suitable for real-time work. The benchmark offers a 'no firing synapse' mode, which in essence detects DRAM and bus speed, however we take the firing mode which adds CPU work with every firing.

The software originally shipped with a benchmark that recorded the first few cycles and output a result. So while fast multi-threaded processors this made the benchmark last less than a few seconds, slow dual-core processors could be running for almost an hour. There is also the issue of DigiCortex starting with a base neuron/synapse map in ‘off mode’, giving a high result in the first few cycles as none of the nodes are currently active. We found that the performance settles down into a steady state after a while (when the model is actively in use), so we asked the author to allow for a ‘warm-up’ phase and for the benchmark to be the average over a second sample time.

For our test, we give the benchmark 20000 cycles to warm up and then take the data over the next 10000 cycles seconds for the test – on a modern processor this takes 30 seconds and 150 seconds respectively. This is then repeated a minimum of 10 times, with the first three results rejected. Results are shown as a multiple of real-time calculation.

(3-1) DigiCortex 1.35 (32k Neuron, 1.8B Synapse)

This test prefers monolithic silicon with proportionally lots of memory bandwidth, which means that we get somewhat of an equalling of results here. The top result in our benchmark database is actually single chiplet Ryzen.

Dwarf Fortress 0.44.12: Link

Another long standing request for our benchmark suite has been Dwarf Fortress, a popular management/roguelike indie video game, first launched in 2006 and still being regularly updated today, aiming for a Steam launch sometime in the future.

Emulating the ASCII interfaces of old, this title is a rather complex beast, which can generate environments subject to millennia of rule, famous faces, peasants, and key historical figures and events. The further you get into the game, depending on the size of the world, the slower it becomes as it has to simulate more famous people, more world events, and the natural way that humanoid creatures take over an environment. Like some kind of virus.

For our test we’re using DFMark. DFMark is a benchmark built by vorsgren on the Bay12Forums that gives two different modes built on DFHack: world generation and embark. These tests can be configured, but range anywhere from 3 minutes to several hours. After analyzing the test, we ended up going for three different world generation sizes:

Small, a 65x65 world with 250 years, 10 civilizations and 4 megabeasts
Medium, a 127x127 world with 550 years, 10 civilizations and 4 megabeasts
Large, a 257x257 world with 550 years, 40 civilizations and 10 megabeasts

DFMark outputs the time to run any given test, so this is what we use for the output. We loop the small test for as many times possible in 10 minutes, the medium test for as many times in 30 minutes, and the large test for as many times in an hour.

(3-2a) Dwarf Fortress 0.44.12 World Gen 65x65, 250 Yr (3-2b) Dwarf Fortress 0.44.12 World Gen 129x129, 550 Yr (3-2c) Dwarf Fortress 0.44.12 World Gen 257x257, 550 Yr

Dwarf Fortress is mainly single-thread limiting, hence the 64-core models at the back end of the queue. The TR parts are still a good bit faster than the EPYC.

Dolphin v5.0 Emulation: Link

Many emulators are often bound by single thread CPU performance, and general reports tended to suggest that Haswell provided a significant boost to emulator performance. This benchmark runs a Wii program that ray traces a complex 3D scene inside the Dolphin Wii emulator. Performance on this benchmark is a good proxy of the speed of Dolphin CPU emulation, which is an intensive single core task using most aspects of a CPU. Results are given in seconds, where the Wii itself scores 1051 seconds.

(3-3) Dolphin 5.0 Render Test

Similarly here, single thread performance matters.

Conclusions: Faster Than Expected

When I started testing for this review, looking purely at the specification sheet, I was expecting AMD’s Threadripper Pro 3995WX to come in just behind the 3990X in most of our testing. The same amount of cores, the same TDP, but slightly lower on frequencies in exchange for double the memory channels and 8x the memory support (also Pro features). More often than not our processor comparisons are usually testing systems with identical memory systems, or we don’t consider that memory difference that major in most of our testing. After going through the end data for this review, it would appear that it makes more of a difference than we initially had thought.

In the tests that matter, most noticeably the 3D rendering tests, we’re seeing a 3% speed-up on the Threadripper Pro compared to the regular Threadripper at the same memory frequency and sub-timings. The core frequencies were preferential on the 3990X, but the memory bandwidth of the 3995WX is obviously helping to a small degree, enough to pull ahead in our testing, along with the benefit of having access to 8x of the memory capacity as well as Pro features for proper enterprise-level administration.

The downside of this comparison is the cost: the SEP difference is +$1500, or another 50%, for the Threadripper Pro 3995WX over the regular Threadripper 3990X. With this price increase, you’re not really paying +50% for the performance difference (ECC memory also costs a good amount), but the feature set. Threadripper Pro is aimed at the visual effects and rendering market, where holding 3D models in main memory is a key aspect of workflow speed as well as full-scene production. Alongside the memory capacity difference, having double the PCIe 4.0 lanes means more access to offload hardware or additional fast storage, also important tools in the visual effects space. Threadripper Pro falls very much into the bucket of 'if you need it, this is the option to go for'.

For our testing, we used the Lenovo Thinkstation P620, the first Threadripper Pro system available in the market, and we’ll have a full review on it shortly. The Thinkstation Pro systems are always well designed workstations with longevity and professional workloads in mind, enabling 280 W cooling with a fun heatsink but also additional custom DRAM fans, a unique motherboard with an easily removable power supply, and support and space for a number of add-in cards. Lenovo’s units, if you buy them individually from the website, are eye-wateringly expensive (+$12200 for the 64-core CPU, a +120% markup), and it is recommended that any design studio that wants to test or order these units should work through a local distributor.

AMD is set to push Threadripper Pro into the consumer and commercial markets beyond Lenovo later this quarter. We have already been in touch with local regional system integrators who are already examining their options based on the three Threadripper Pro motherboards set to be available in the market from ASUS, GIGABYTE, and Supermicro. We are expecting a range of options to be available, and most design studios are likely to order pre-built systems with a variety of air and liquid cooling.

What might confuse a few users is that AMD is launching Threadripper Pro into the major market now, right on the cusp of its next-generation EPYC launch in the next eight weeks. These new EPYC processors should afford a sizeable raw compute upgrade moving to Zen 3 cores, all while Threadripper Pro is on Zen 2. As we saw comparing TR Pro to EPYC in this review, both on Zen 2, in some circumstances it is the push up to 280 W where TR Pro gets the best performance, and a 280 W version of next-generation EPYC might seem more appealing to users looking at TR Pro today. What exactly AMD will launch for EPYC is unknown, whereas TR Pro on this generation is now a known performance factor that system integrators are building on for the workstation market. EPYC never really fit into the workstation market that easily, which is why TR Pro exists today.

We have heard some conflicting dates as to when exactly Threadripper Pro will come to the mass market beyond Lenovo, but they all fall within Q1. We have reached out to AMD in order to source the other processors for our testing.