Original Link: https://www.anandtech.com/show/16778/amd-epyc-milan-review-part-2
AMD EPYC Milan Review Part 2: Testing 8 to 64 Cores in a Production Platform
by Andrei Frumusanu on June 25, 2021 9:30 AM ESTIt’s been a few months since AMD first announced their new third generation EPYC Milan server CPU line-up. We had initially reviewed the first SKUS back in March, covering the core density optimised 64-core EPYC 7763, EPYC 7713 and the core-performance optimised 32-core EPYC 75F3. Since then, we’ve ben able to get our hands on several new mid and lower end SKUs in the form of the new 24-core EPYC 7443, the 16-core 7343, as well as the very curious 8-core EPYC 72F3 which we’ll be reviewing today.
What’s also changed since our initial review back in March, is the release of Intel’s newer 3rd generation Xeon Scalable processors (Ice Lake SP) with our review of the 40-core Xeon 8330 and 28-core Xeon 6330.
Today’s review will be focused around the new performance numbers of AMD’s EPYC CPUs, for a more comprehensive platform and architecture overview I highly recommend reading our respective initial reviews which go into more detail of the current server CPU landscape:
-
(April 6th) Intel 3rd Gen Xeon Scalable (Ice Lake SP) Review: Generationally Big, Competitively Small
-
(March 15th) AMD 3rd Gen EPYC Milan Review: A Peak vs Per Core Performance Balance
-
(December 18th) The Ampere Altra Review: 2x 80 Cores Arm Server Performance Monster
What's New: EPYC 7443, 7343, 72F3 Low Core Count SKUs
In terms of new SKUs that we’re testing today, as mentioned, we’ll be looking at AMD new EPYC 7443, 7343 as well as the 72F3, mid- to low core-count SKUs that come at much more affordable price tags compared to the flagship units we had initially reviewed back in March. As part of the new platform switch, we’ll cover in a bit, we’re also re-reviewing the 64-core EPYC 7763 and the 32-core EPYC 75F3 – resulting in a few surprises and resolving some of the issues we’ve identified with 3rd generation Milan in our first review.
AMD EPYC 7003 Processors Core Performance Optimized |
||||||
Cores Threads |
Base Freq |
Turbo Freq |
L3 (MB) |
TDP | Price | |
F-Series | ||||||
EPYC 75F3 | 32 / 64 | 2950 | 4000 | 256 MB |
280 W | $4860 |
EPYC 74F3 | 24 / 48 | 3200 | 4000 | 240 W | $2900 | |
EPYC 73F3 | 16 / 32 | 3500 | 4000 | 240 W | $3521 | |
EPYC 72F3 | 8 / 16 | 3700 | 4100 | 180 W | $2468 |
Starting off with probably the weirdest CPU in AMD’s EPYC 7003 line-up, the new 72F3 is quite the speciality part in the form of it being an 8-core server CPU, yet still featuring the maximum available platform capabilities as well as the full 256MB of L3 cache. AMD achieves this by essentially populating the part with 8 chiplet dies with each a full 32MB of L3 cache, but only one core enabled per die. This enables the part (for a server part) relatively high base frequency of 3.7GHz, boosting up to 4.1GHz and landing with a TDP of 180W, with the part costing $2468.
The unit is a quite extreme case of SKU segmentation and focuses on deployments where per-core performance is paramount, or also use-cases where per-core software licenses vastly outweigh the cost of the actual hardware. We’re also re-reviewing the 32-core 75F3 in this core-performance optimised family, featuring up to 32 cores, but going for much higher 280W TDPs.
AMD EPYC 7003 Processors Core Density Optimized |
||||||
Cores Threads |
Base Freq |
Turbo Freq |
L3 (MB) |
TDP | Price | |
EPYC 7763 | 64 / 128 | 2450 | 3400 | 256 MB |
280 W | $7890 |
EPYC 7713 | 64 / 128 | 2000 | 3675 | 225 W | $7060 | |
EPYC 7663 | 56 / 112 | 2000 | 3500 | 240 W | $6366 | |
EPYC 7643 | 48 / 96 | 2300 | 3600 | 225 W | $4995 | |
P-Series (Single Socket Only) | ||||||
EPYC 7713P | 64 / 128 | 2000 | 3675 | 256 | 225 W | $5010 |
In the core-density optimised series, we’re continuing on using the 64-core EPYC 7763 flagship SKU which lands in at 280W TDP and a high cost of $7890 MSRP. Unfortunately, we no longer have access to the EPYC 7713 so we couldn’t re-review this part, and benchmark numbers from this SKU in this review will carry forward our older scores, also being aptly labelled as such in our graphs.
AMD EPYC 7003 Processors | ||||||
Cores Threads |
Base Freq |
Turbo Freq |
L3 (MB) |
TDP | Price | |
EPYC 7543 | 32 / 64 | 2800 | 3700 | 256 MB | 225 W | $3761 |
EPYC 7513 | 32 / 64 | 2600 | 3650 | 128 MB | 200 W | $2840 |
EPYC 7453 | 28 / 56 | 2750 | 3450 | 64 MB | 225 W | $1570 |
EPYC 7443 | 24 / 48 | 2850 | 4000 | 128 MB |
200 W | $2010 |
EPYC 7413 | 24 / 48 | 2650 | 3600 | 180 W | $1825 | |
EPYC 7343 | 16 / 32 | 3200 | 3900 | 190 W | $1565 | |
EPYC 7313 | 16 / 32 | 3000 | 3700 | 155 W | $1083 | |
P-Series (Single Socket Only) | ||||||
EPYC 7543P | 32 / 64 | 2800 | 3700 | 256 MB | 225 W | $2730 |
EPYC 7443P | 24 / 48 | 2850 | 4000 | 128 MB | 200 W | $1337 |
EPYC 7313P | 16 / 32 | 3000 | 3700 | 155 W | $913 |
Finally, the most interesting parts of today’s evaluation are AMD’s mid- to low-core count EPYC 7443 and EPYC 7343 CPUs. At 24- and 16-core, the chips feature a fraction of the maximum theoretical core counts of the platform, but also come at much more affordable price points. These parts should especially be interesting for deployments that plan on using the platform’s full memory or I/O capabilities, but don’t require the raw processing power of the higher-end parts.
These two parts are also defined by having only 128MB of L3 cache, meaning the chips are running only 4 active chiplets, with respectively only 6 and 4 cores per chiplet active. The TDPs are also more reasonable at 200W and 190W, with also respectively lower pricing of $2010 and $1565.
Following Intel’s 3rd generation Xeon Ice Lake SP and our testing of the Xeon 28-core 6330 which lands in at an MSRP of $1894, it’s here where we’ll be seeing the most interesting performance and value comparison for today’s review.
Test Platform Change - Production Milan Board from GIGABYTE: MZ72-HB0 (rev. 3.0)
In our initial Milan review, we unfortunately had to work with AMD to remotely test newest Milan parts within the company’s local datacentre, as our own Daytona reference server platform encountered an unrecoverable hardware failure.
In general, if possible, we also prefer to test things on production systems as they represent a more mature and representative firmware stack.
A few weeks ago, at Computex, GIGABYTE had revealed their newest revision of the company’s dual-socket EPYC board, the E-ATX MZ72-HB0 rev.3.0, which now comes with out-of-box support for the newest 3rd generation Milan parts (The prior rev.1.0 boards don’t support the new CPUs).
The E-ATX form-factor allows for more test-bench setups and noiseless operation (Thanks to Noctua’s massive NH-U14S TR4-SP3 coolers) in more conventional workstation setups.
The platform change away from AMD’s Daytona reference server to the GIGABYTE system also have some significant impacts in regards to the 3rd generation Milan SKUs’ performance, behaving notably different in terms of power characteristics than what we saw on AMD’s system, allowing the chips to achieve even higher performance than what we had tested and published in our initial review.
Test Bed and Setup - Compiler Options
For the rest of our performance testing, we’re disclosing the details of the various test setups:
AMD - Dual EPYC 7763 / 75F3 / 7443 / 7343 / 72F3
For today’s review in terms of now performance figure, we’re now using GIGABYTE’s new MZ72-HB0 rev.3.0 board as the primary test platform for the EPYC 7763, 75F3, 7443, 7343 and 72F3. The system is running under full default settings, meaning performance or power determinism as configured by AMD in their default SKU fuse settings.
CPU | 2x AMD EPYC 7763 (2.45-3.500 GHz, 64c, 256 MB L3, 280W) / 2x AMD EPYC 75F3 (3.20-4.000 GHz, 32c, 256 MB L3, 280W) / 2x AMD EPYC 7443 (2.85-4.000 GHz, 24c, 128 MB L3, 200W) / 2x AMD EPYC 7343 (3.20-3.900 GHz, 16c, 128 MB L3, 190W) / 2x AMD EPYC 72F3 (3.70-4.100 GHz, 8c, 256MB L3, 180W) |
RAM | 512 GB (16x32 GB) Micron DDR4-3200 |
Internal Disks | Crucial MX300 1TB |
Motherboard | GIGABYTE MZ72-HB0 (rev. 3.0) |
PSU | EVGA 1600 T2 (1600W) |
Software wise, we ran Ubuntu 20.10 images with the latest release 5.11 Linux kernel. Performance settings both on the OS as well on the BIOS were left to default settings, including such things as a regular Schedutil based frequency governor and the CPUs running performance determinism mode at their respective default TDPs unless otherwise indicated.
AMD - Dual EPYC 7713 / 7662
Due to not having access to the 7713 for this review, we’re picking up the older test numbers of the chip on AMD’s Daytona platform. We also tested the Rome EPYC 7662 – these latter didn’t exhibit any issues in terms of their power behaviour.
CPU | 2x AMD EPYC 7713 (2.00-3.365 GHz, 64c, 256 MB L3, 225W) / 2x AMD EPYC 7662 (2.00-3.300 GHz, 64c, 256 MB L3, 225W) |
RAM | 512 GB (16x32 GB) Micron DDR4-3200 |
Internal Disks | Varying |
Motherboard | Daytona reference board: S5BQ |
PSU | PWS-1200 |
AMD - Dual EPYC 7742
Our local AMD EPYC 7742 system, due to the aforementioned issues with the Daytona hardware, is running on a SuperMicro H11DSI Rev 2.0.
CPU | 2x AMD EPYC 7742 (2.25-3.4 GHz, 64c, 256 MB L3, 225W) |
RAM | 512 GB (16x32 GB) Micron DDR4-3200 |
Internal Disks | Crucial MX300 1TB |
Motherboard | SuperMicro H11DSI0 |
PSU | EVGA 1600 T2 (1600W) |
As an operating system we’re using Ubuntu 20.10 with no further optimisations. In terms of BIOS settings we’re using complete defaults, including retaining the default 225W TDP of the EPYC 7742’s, as well as leaving further CPU configurables to auto, except of NPS settings where it’s we explicitly state the configuration in the results.
The system has all relevant security mitigations activated against speculative store bypass and Spectre variants.
Intel - Dual Xeon Platinum 8380
For our new Ice Lake test system based on the Whiskey Lake platform, we’re using Intel’s SDP (Software Development Platform 2SW3SIL4Q, featuring a 2-socket Intel server board (Coyote Pass).
The system is an airflow optimised 2U rack unit with otherwise little fanfare.
Our review setup solely includes the new Intel Xeon 8380 with 40 cores, 2.3GHz base clock, 3.0GHz all-core boost, and 3.4GHz peak single core boost. That’s unusual about this part as noted in the intro, it’s running at a default 205W TDP which is above what we’ve seen from previous generation non-specialised Intel SKUs.
CPU | 2x Intel Xeon Platinum 8380 (2.3-3.4 GHz, 40c, 60MB L3, 270W) |
RAM | 512 GB (16x32 GB) SK Hynix DDR4-3200 |
Internal Disks | Intel SSD P5510 7.68TB |
Motherboard | Intel Coyote Pass (Server System S2W3SIL4Q) |
PSU | 2x Platinum 2100W |
The system came with several SSDs including Optane SSD P5800X’s, however we ran our test suite on the P5510 – not that we’re I/O affected in our current benchmarks anyhow.
As per Intel guidance, we’re using the latest BIOS available with the 270 release microcode update.
Intel - Dual Xeon Platinum 8280
For the older Cascade Lake Intel system we’re also using a test-bench setup with the same SSD and OS image as on the EPYC 7742 system.
Because the Xeons only have 6-channel memory, their maximum capacity is limited to 384GB of the same Micron memory, running at a default 2933MHz to remain in-spec with the processor’s capabilities.
CPU | 2x Intel Xeon Platinum 8280 (2.7-4.0 GHz, 28c, 38.5MB L3, 205W) |
RAM | 384 GB (12x32 GB) Micron DDR4-3200 (Running at 2933MHz) |
Internal Disks | Crucial MX300 1TB |
Motherboard | ASRock EP2C621D12 WS |
PSU | EVGA 1600 T2 (1600W) |
The Xeon system was similarly run on BIOS defaults on an ASRock EP2C621D12 WS with the latest firmware available.
Ampere "Mount Jade" - Dual Altra Q80-33
The Ampere Altra system we’re using the provided Mount Jade server as configured by Ampere. The system features 2 Altra Q80-33 processors within the Mount Jade DVT motherboard from Ampere.
In terms of memory, we’re using the bundled 16 DIMMs of 32GB of Samsung DDR4-3200 for a total of 512GB, 256GB per socket.
CPU | 2x Ampere Altra Q80-33 (3.3 GHz, 80c, 32 MB L3, 250W) |
RAM | 512 GB (16x32 GB) Samsung DDR4-3200 |
Internal Disks | Samsung MZ-QLB960NE 960GB Samsung MZ-1LB960NE 960GB |
Motherboard | Mount Jade DVT Reference Motherboard |
PSU | 2000W (94%) |
The system came preinstalled with CentOS 8 and we continued usage of that OS. It’s to be noted that the server is naturally Arm SBSA compatible and thus you can run any kind of Linux distribution on it.
The only other note to make of the system is that the OS is running with 64KB pages rather than the usual 4KB pages – this either can be seen as a testing discrepancy or an advantage on the part of the Arm system given that the next page size step for x86 systems is 2MB – which isn’t feasible for general use-case testing and something deployments would have to decide to explicitly enable.
The system has all relevant security mitigations activated, including SSBS (Speculative Store Bypass Safe) against Spectre variants.
The system has all relevant security mitigations activated against the various vulnerabilities.
Compiler Setup
For compiled tests, we’re using the release version of GCC 10.2. The toolchain was compiled from scratch on both the x86 systems as well as the Altra system. We’re using shared binaries with the system’s libc libraries.
AMD Platform vs GIGABYTE: IO Power Overhead Gone
Starting off with the big change for toady’s review: the new production-grade GIGABYTE Milan compatible test platform.
In our original review of Milan, we had initially discovered that AMD’s newest generation chips had one large glass jaw: the platform’s extremely high idle package power behaviour exceeding 100W. This was a notable regression compared to what we saw on Rome, and we deemed it as a core cause as of why Milan was seeing some performance regressions in certain workloads compared to the predecessor Rome SKUs.
We had communicated our findings and worries to AMD prior to the review publishing, but never root-caused the issue, and never were able to confirm whether this was the intended behaviour of the new Milan chips or not. We theorized that it was a side-effect of the new sIOD which had the infinity fabric running at a higher frequency, which this generation runs in 1:1 mode with the memory controller clocks.
To our surprise, when setting up the new GIGABYTE system, we found out that this behaviour of extremely high idle power was not being exhibited on the new test platform.
Indeed, instead of the 100W idle figures as we had tested on the Daytona system, we’re now seeing figures that are pretty much in line with AMD’s Rome system, at around 65-72W. The biggest discrepancy was found in the 75F3 part, which now idles 39W less than on the Daytona system.
Milan Power Efficiency | |||||||||||||
SKU | EPYC 7763 (Milan) |
||||||||||||
Motherboard/ Platform |
Daytona | GIGABYTE | |||||||||||
TDP Setting | 280W |
||||||||||||
Perf |
PKG (W) |
Core (W) |
Perf | PKG (W) |
Core (W) |
||||||||
500.perlbench_r | 281 | 274 | 166 | 317 | 282 | 195 | |||||||
502.gcc_r | 262 | 262 | 131 | 271 | 265 | 150 | |||||||
505.mcf_r | 155 | 252 | 115 | 158 | 252 | 132 | |||||||
520.omnetpp_r | 142 | 249 | 120 | 144 | 244 | 133 | |||||||
523.xalancbmk_r | 181 | 261 | 131 | 195 | 266 | 152 | |||||||
525.x264_r | 602 | 279 | 172 | 641 | 283 | 196 | |||||||
531.deepsjeng_r | 262 | 267 | 161 | 296 | 283 | 196 | |||||||
541.leela_r | 267 | 249 | 148 | 303 | 274 | 199 | |||||||
548.exchange2_r | 487 | 274 | 176 | 543 | 262 | 202 | |||||||
557.xz_r | 190 | 260 | 141 | 206 | 272 | 171 | |||||||
SPECint2017 | 255 | 260 | 141 | 275 | 265 | 164 | |||||||
kJ Total | 2029 | 1932 | |||||||||||
Score / W | 0.980 | 1.037 | |||||||||||
503.bwaves_r | 354 | 226 | 90 | 362 | 218 | 99 | |||||||
507.cactuBSSN_r | 222 | 278 | 150 | 229 | 285 | 174 | |||||||
508.namd_r | 282 | 279 | 176 | 280 | 260 | 193 | |||||||
510.parest_r | 153 | 256 | 119 | 162 | 259 | 138 | |||||||
511.povray_r | 348 | 275 | 176 | 387 | 255 | 193 | |||||||
519.lbm_r | 39 | 219 | 84 | 40 | 210 | 92 | |||||||
526.blender_r | 372 | 276 | 165 | 396 | 282 | 188 | |||||||
527.cam4_r | 399 | 278 | 147 | 417 | 285 | 170 | |||||||
538.imagick_r | 446 | 278 | 178 | 471 | 268 | 200 | |||||||
544.nab_r | 259 | 278 | 175 | 275 | 282 | 198 | |||||||
549.fotonik3d_r | 110 | 220 | 86 | 113 | 215 | 95 | |||||||
554.roms_r | 88 | 243 | 106 | 89 | 241 | 119 | |||||||
SPECfp2017 | 211 | 240 | 110 | 220 | 235 | 123 | |||||||
kJ Total | 4980 | 4716 | |||||||||||
Score / W | 0.879 | 0.9361 |
A more detailed power analysis of the EPYC 7763 during our SPEC2017 runs confirms the change in the power behaviour. Although the total average package power hasn’t changed much between the systems, in the integer suite now 5W higher at 265W vs 260W, and in the FP suite now 5W lower at 235W vs 240W, what more significantly changes is the core power allocation which is now much higher on the GIGABYTE system.
In core-bound workloads with little memory pressure, such as 541.leela_r, the core power of the EPYC 7763 went up from 148W to 199W, a +51W increase or +34%. Naturally because of this core power increase, there’s also a corresponding large performance increase of +13.3%.
The behaviour change doesn’t apply to every workload, memory-heavy workloads such as 519.lbm don’t see much of a change in power behaviour, and only showcase a small performance boost.
Reviewing the performance differences between the original Daytona system tested figures and the new GIGABYTE motherboard test-runs, we’re seeing some significant performance boosts across the board, with many 10-13% increases in compute bound and core-power bound workloads.
These figures are significant enough that they do change the overall verdict of those SKUs, and they also change the tone of our final review verdict on Milan, as evidently the one weakness the new generation had was actually not a design mishap, but actually was an issue with the Daytona system. It explains a lot of the more lacklustre performance increases of Milan vs Rome, and we’re happy that this was ultimately not an issue for production-grade platforms.
As a note, because we also have the 4-chiplet EPYC 7443 and EPYC 7343 SKUs in-house now, we also measured the platform idle power of those units, which came in at 50 and 52W. This is actually quite a bit below the 65-75W of the 8-chiplet 7763, 75F3 and 72F3 parts, which indicates that this power behaviour isn’t solely internal to the sIOD chiplet, but actually part of the sIOD and CCD interfaces, or as well the CCD L3 cache power.
SPEC - Multi-Threaded Performance - Subscores
Picking up from the power efficiency discussion, let’s dive directly into the multi-threaded SPEC results. As usual, because these are not officially submitted scores to SPEC, we’re labelling the results as “estimates” as per the SPEC rules and license.
We compile the binaries with GCC 10.2 on their respective platforms, with simple -Ofast optimisation flags and relevant architecture and machine tuning flags (-march/-mtune=Neoverse-n1 ; -march/-mtune=skylake-avx512 ; -march/-mtune=znver2 (for Zen3 as well due to GCC 10.2 not having znver3).
I’ll be going over two chart comparisons, first of all, with the respective flagship parts, consisting of the new EPYC 7763 numbers, pitted against Intel’s 40-core Xeon Ice Lake SP and Ampere’s Altra Q80-33, along with the figures we have on AMD’s EPYC 7742. It’s to be noted that this latter is a 225W part, compared to the 280W 7763.
In SPECint2017, the EPYC 7763 extends its lead over Intel’s current best CPU, improving the numbers beyond what we had originally published in our April review. While AMD also further narrows the gap to Ampere’s 80-core Altra SKU, there are still many core-bound workloads that still notably favour the Neoverse N1 part given its 25% advantage in core count.
Also in SPECint2017, but this time focusing on the mid-tier SKUs, the main comparison points that are interesting here is the new 24-core EPYC 7443 and 16-core EPYC 7343 against the new 28-core Xeon 6330. What’s shocking here, is that Intel’s new Ice Lake SP server chip has troubles not only competing against AMD’s 24-core chip, but actually even struggles to differentiate itself from AMD’s 16-core chip, which is quite shocking.
The 72F3 8-core part is interesting, but generally we have troubles to competitively place such a SKU given that we don’t have a comparable part from the competition to pit against it.
For the high-end SKUs, we again see the 7763 increase its performance positioning compared to what we had review a few months ago, although with fewer large performance boost outliers, due o the memory-heavy nature of the floating-point test suite.
In the low-end SKUs, we see a similar story as in the integer suite, where AMD’s 16-core 7343 battles it out against Intel’s 28-core Xeon, while the 24-core unit is comfortable a good margin ahead of the competition.
The 72F3 showcases some interesting score here – because there’s more workloads that are fundamentally memory bound; the actual core count deficit of this SKU doesn’t really hamper its performance compared to its siblings. If anything, the lower core count actually has some positive side-effects as it results in less cache and DRAM contention, resulting in less overhead and actually higher performance than the higher core count parts. Theoretically you could mimic this with the higher core count parts by simply running fewer workload instances and threads, but if a system deployment would be running workloads that are more typical of such performance characteristics, the low-core count 72F3 could make sense.
SPEC - Multi-Threaded Performance - Aggregate
Due to the number of SKUs, I’m splitting up the aggregate scores on its own page:
In SPECint2017, for the high-end SKUs, we see the 7763 improve its 2S score by 4.89%, while the 1S score increases by 7.9%. The 75F3 also similarly sees a +3.9% 2S increase, and a 5.4% 1S increase, both extending their leadership positions.
The low-core count SKUs, while notably quite lower in their relative performance, are still extremely impressive from a competitive landscape, with again the 16-core 7343 having no issues in keeping up with the new 28-core Xeon 6330 or last-gen flagship Xeon 8280.
In the floating-point suite more representative of HPC workloads, we again also see the 7763 increase its scores by +1.3% in 2S and +4.2% in 1S configurations, while the 75F3 increases its positioning by +4.2% and +4.8%. The increases don’t change the overall landscape much, but it allows the parts to be better than initially reviewed.
Again, the 7443 and 7343 are excellent performers here given they outperform the Xeon 6330 and Xeon 8280.
SPEC - Single-Threaded Performance
Single-thread performance of server CPUs usually isn’t the most important metric for most scale-out workloads, but there are use-cases such as EDA tools which are pretty much single-thread performance bound.
Power envelopes here usually don’t matter, and what is actually the performance factor that comes at play here is simply the boost clocks of the CPUs as well as the IPC improvement, and memory latency of the cores.
What’s interesting in the results is the single-core performance of the 7443 and 7343 tracking extremely close to the frequency-optimised 72F3 and 75F3 parts. Indeed, the actual boost frequencies between the SKUs are quite similar, so there really wasn’t any particular reasons why the lower-core “regular” units wouldn’t perform similarly to their “core-performance” optimised parts.
It’s again notable that unlike Intel’s SKUs, going for lower core count parts on AMD EPYC 7003 series actually results in better single-threaded performance, with higher boost frequencies than the large core-count parts.
SPEC - Per-Core Performance under Load
A metric that is actually more interesting than isolated single-thread performance, is actually per-thread performance in a fully loaded system. This actually is a measurement and benchmark figure that would greatly interest enterprises and customers which are running software or workloads that are possibly licensed on a per-core basis, or simply workloads that require a certain level of per-thread service level agreement in terms of performance.
It’s also here where AMD’s new low-core count SKUs are extremely interesting, allowing to really distinguish themselves:
Starting off with the EPYC 7343 and the 7443, what’s really interesting to see here is that they’re both well keeping up with the more expensive 75F3 SKU in terms of per-thread performance. The EPYC 7343 actually outperforms the 7443 in the FP test suite because it has 33% less cores to share the L3 cache, and 50% less cores it has to share the DRAM resources against.
The 72F3 here also showcases its extreme positioning in the SKU stack, having the full 32MB L3 dedicated to a single core, and with the full 8-channel DRAM resources shared only amongst 8 cores, it results it outstandingly good per-thread performance. The chip when running 2 threads per core actually still outperforms the per-thread performance of other higher density core count SKUs running only 1 thread per core.
A good visualisation of the socket throughput versus per-thread performance metrics is plotting the various datapoints in a chart on those two axes:
For today’s review, the 7763 and 75F3 move further to the right and higher than they were before, while the new 7443 and 7343 showcase a quite stark competitive situation against Intel’s Xeon 6330.
The Xeon 6330 costs $1894, while the 7443 and 7343 respectively land in at $2010 and $1563. In terms of socket throughput, the Intel chip roughly matches the 16-core AMD counterpart, while the AMD chip is showcasing 48-75% better performance per thread. The 24-core 7443 showcases 26-32% more socket performance while also at the same time having 47-54% better per-thread performance, while only being priced 6% higher. It seems that it’s clear which designs provide the better value.
SPECjbb MultiJVM - Java Performance
Moving on from SPECCPU, we shift over to SPECjbb2015. SPECjbb is a from ground-up developed benchmark that aims to cover both Java performance and server-like workloads, from the SPEC website:
“The SPECjbb2015 benchmark is based on the usage model of a worldwide supermarket company with an IT infrastructure that handles a mix of point-of-sale requests, online purchases, and data-mining operations. It exercises Java 7 and higher features, using the latest data formats (XML), communication using compression, and secure messaging.
Performance metrics are provided for both pure throughput and critical throughput under service-level agreements (SLAs), with response times ranging from 10 to 100 milliseconds.”
The important thing to note here is that the workload is of a transactional nature that mostly works on the data-plane, between different Java virtual machines, and thus threads.
We’re using the MultiJVM test method where as all the benchmark components, meaning controller, server and client virtual machines are running on the same physical machine.
The JVM runtime we’re using is OpenJDK 15 on both x86 and Arm platforms, although not exactly the same sub-version, but closest we could get:
EPYC & Xeon systems:
openjdk 15 2020-09-15
OpenJDK Runtime Environment (build 15+36-Ubuntu-1)
OpenJDK 64-Bit Server VM (build 15+36-Ubuntu-1, mixed mode, sharing)
Altra system:
openjdk 15.0.1 2020-10-20
OpenJDK Runtime Environment 20.9 (build 15.0.1+9)
OpenJDK 64-Bit Server VM 20.9 (build 15.0.1+9, mixed mode, sharing)
Furthermore, we’re configuring SPECjbb’s runtime settings with the following configurables:
SPEC_OPTS_C="-Dspecjbb.group.count=$GROUP_COUNT -Dspecjbb.txi.pergroup.count=$TI_JVM_COUNT -Dspecjbb.forkjoin.workers=N -Dspecjbb.forkjoin.workers.Tier1=N -Dspecjbb.forkjoin.workers.Tier2=1 -Dspecjbb.forkjoin.workers.Tier3=16"
Where N=160 for 2S Altra test runs, N=80 for 1S Altra test runs, N=112 for 2S Xeon 8280, N=56 for 1S Xeon 8280, and N=128 for 2S and 1S on the EPYC system. The 75F3 system had the worker count reduced to 64 and 32 for 2S/1S runs, with the 7443, 7343 and 72F3 also having the same thread to core ratiio.
The Xeon 8380 was running at N=140 for 2S Xeon 8380 and N=70 for 1S - the benchmark had been erroring out at higher thread counts.
In terms of JVM options, we’re limiting ourselves to bare-bone options to keep things simple and straightforward:
EPYC & Altra systems:
JAVA_OPTS_C="-server -Xms2g -Xmx2g -Xmn1536m -XX:+UseParallelGC "
JAVA_OPTS_TI="-server -Xms2g -Xmx2g -Xmn1536m -XX:+UseParallelGC"
JAVA_OPTS_BE="-server -Xms48g -Xmx48g -Xmn42g -XX:+UseParallelGC -XX:+AlwaysPreTouch"
Xeon Cascade Lake systems:
JAVA_OPTS_C="-server -Xms2g -Xmx2g -Xmn1536m -XX:+UseParallelGC"
JAVA_OPTS_TI="-server -Xms2g -Xmx2g -Xmn1536m -XX:+UseParallelGC"
JAVA_OPTS_BE="-server -Xms172g -Xmx172g -Xmn156g -XX:+UseParallelGC -XX:+AlwaysPreTouch"
Xeon Ice Lake systems (SNC1):
JAVA_OPTS_C="-server -Xms2g -Xmx2g -Xmn1536m -XX:+UseParallelGC"
JAVA_OPTS_TI="-server -Xms2g -Xmx2g -Xmn1536m -XX:+UseParallelGC"
JAVA_OPTS_BE="-server -Xms192g -Xmx192g -Xmn168g -XX:+UseParallelGC -XX:+AlwaysPreTouch"
Xeon Ice Lake systems (SNC2):
JAVA_OPTS_C="-server -Xms2g -Xmx2g -Xmn1536m -XX:+UseParallelGC"
JAVA_OPTS_TI="-server -Xms2g -Xmx2g -Xmn1536m -XX:+UseParallelGC"
JAVA_OPTS_BE="-server -Xms96g -Xmx96g -Xmn84g -XX:+UseParallelGC -XX:+AlwaysPreTouch"
The reason the Xeon CLX system is running a larger back-end heap is because we’re running a single NUMA node per socket, while for the Altra and EPYC we’re running four NUMA nodes per socket for maximised throughput, meaning for the 2S figures we have 8 backends running for the Altra and EPYC and 2 for the Xeon, and naturally half of those numbers for the 1S benchmarks.
For the Ice Lake system, I ran both SNC1 (one NUMA node) as SNC2 (two nodes), with the corresponding scaling in the back-end memory allocation.
The back-ends and transaction injectors are affinitised to their local NUMA node with numactl –cpunodebind
and –membind
, while the controller is called with –interleave=all
.
The max-jOPS and critical-jOPS result figures are defined as follows:
"The max-jOPS is the last successful injection rate before the first failing injection rate where the reattempt also fails. For example, if during the RT-curve phase the injection rate of 80000 passes, but the next injection rate of 90000 fails on two successive attempts, then the max-jOPS would be 80000."
"The overall critical-jOPS is computed by taking the geomean of the individual critical-jOPS computed at these five SLA points, namely:
• Critical-jOPSoverall = Geo-mean of (critical-jOPS@ 10ms, 25ms, 50ms, 75ms and 100ms response time SLAs)
During the RT curve building phase the Transaction Injector measures the 99th percentile response times at each step level for all the requests (see section 9) that are considered in the metrics computations. It then computes the Critical-jOPS for each of the above five SLA points using the following formula:
(first * nOver + last * nUnder) / (nOver + nUnder) "
That’s a lot of technicalities to explain an admittedly complex benchmark, but the gist of it is that max-jOPS represents the maximum transaction throughput of a system until further requests fail, and critical-jOPS is an aggregate geomean transaction throughput within several levels of guaranteed response times, essentially different levels of quality of service.
In the max-jOPS metric, the re-tested 7763 increases its throughput by 5%, while the 75F3 oddly enough didn’t see a notable increase.
The 7343 here doesn’t fare quite as well to the Intel competition as in prior tests, AMD’s high core-to-core latency is still a larger bottleneck in such transactional and database-like workloads compared to Intel’s monolithic mesh approach. Only the 7443 manages to have a slight edge over the 28-core Intel SKU.
In the critical-jOPS metric, both the 16- and 24-core EPYCs lose out to the 28-core Xeon. Unfortunately we don’t have the Xeon 6330 numbers here due those chips and system being in a different location.
Compiling Performance / LLVM
As we’re trying to rebuild our server test suite piece by piece – and there’s still a lot of work go ahead to get a good representative “real world” set of workloads, one more highly desired benchmark amongst readers was a more realistic compilation suite. Chrome and LLVM codebases being the most requested, I landed on LLVM as it’s fairly easy to set up and straightforward.
git clone https://github.com/llvm/llvm-project.git
cd llvm-project
git checkout release/11.x
mkdir ./build
cd ..
mkdir llvm-project-tmpfs
sudo mount -t tmpfs -o size=10G,mode=1777 tmpfs ./llvm-project-tmpfs
cp -r llvm-project/* llvm-project-tmpfs
cd ./llvm-project-tmpfs/build
cmake -G Ninja \
-DLLVM_ENABLE_PROJECTS="clang;libcxx;libcxxabi;lldb;compiler-rt;lld" \
-DCMAKE_BUILD_TYPE=Release ../llvm
time cmake --build .
We’re using the LLVM 11.0.0 release as the build target version, and we’re compiling Clang, libc++abi, LLDB, Compiler-RT and LLD using GCC 10.2 (self-compiled). To avoid any concerns about I/O we’re building things on a ramdisk. We’re measuring the actual build time and don’t include the configuration phase as usually in the real world that doesn’t happen repeatedly.
In compiling workloads, the 7763 and 75F3 also saw a 3-4% increase in performance compared to their initial reviews.
The 16-core 7343 ends up as the worst performing chip in this metric, while the 24-core 7443 still managed to put itself well ahead of the 28-core Xeon 6330.
I’ve omitted the 72F3 from the chart due to its bad results of >17 minutes per socket rescaling the chart too much – obviously compiling is not the use-case for that SKU.
Conclusion & End Remarks
Today’s review article is in a sense very much a follow-up piece to our original Milan review back in March. There are two sides of today’s new performance numbers: the improved positioning and fixed power behaviour of the high-core count SKUs, as well as the addition of the 16- and 24-core count units into the competitive landscape.
Starting off with the platform change, the issues that very odd power behaviour that we initially encountered on AMD’s Daytona platform back in March do not exhibit themselves on our new GIGABYTE test platform. This means that the high idle power figures of +100W are gone, and that power deficit went back towards the actual CPU cores, allowing for the EPYC 7763 and EPYC 75F3 we’ve retested today to perform quite a bit better than what we had initially published in our Milan and follow-up Ice Lake SP reviews. The 7763’s socket performance increased +7.9% in SPECint2017, which many core-bound compute workloads seeing increases of +13%.
In general, this is a much welcome resolution to the one thorn in the eye we initially encountered with Milan – and it now represents a straight up upgrade over Rome in every aspect, without compromises.
The second part of today’s review revolves around the lower core count SKUs in the Milan line-up.
Starting off with the 8-core 72F3, this is admittedly quite the odd part, and will not be of use for everybody but the most specialised deployments. The one thing about the chip that makes it stand out is excellent per-thread performance. The use-cases for the chip are those where per-core licenses play a large role in the total cost of ownership, and here the chip should fill that role well.
The 24-core EPYC 7443 and 16-core EPYC 7343 are more interesting in the mid-stack given their excellent performance. Naturally, socket performance is lower than the higher core count SKUs, but it scales sub-linearly down.
The most interesting comparisons today were pitting the 24- and 16-core Milan parts against Intel’s newest 28-core Xeon 6330 based on the new Ice Lake SP microarchitecture. The AMD parts are also in the same price range to Intel’s chip, at $2010 and $1565 versus $1894. The 16-core chip actually mostly matches the performance of the 28-competitor in many workloads while still showcasing a large per-thread performance advantage, while the 24-core part, being 6% more expensive, more notably showcases both large total +26% throughput and large +47% per-thread performance leadership. Database workloads are admittedly still AMD’s weakness here, but in every other scenario, it’s clear which is the better value proposition.