Actually, it was Ian and Andrei, not Anand who did the review. I'm not joking, he's a real person who used to work on this site: https://www.anandtech.com/print/1635/
That's "team Anandtech" but from the times when the site was done by a single person. I used to read Tom's Hardware, then it and Anandtech (sometime before 2000 I think, the site was started in 1997), then Anandtech only. Lots of quality hardware information, even though at first the site sometime covered antiviruses and some other software. Considering the entire history, reviewers other than Anand are a recent phenomenon :)
I wonder if AMD is going to add 120 W CPUs again - EPYC Rome had 4 CPUs with only 4 memory channels of bandwidth, but with a lower TDP, including the EPYC 7282.
Yes, it's a shame those type of parts didn't really get attention yet. It's really great you can get 128 Cores and 256 threads in a 2U server, But if you just need 20 VM's running on a super stable platform, 16 threads and 50 Watts are more than enough.
It's not as egregious as "Linux on the Desktop": ARM on the server is actually gaining foothold, especially for Cloud-hosted companies.
Though x86-64 will be around for a LONG time - ARM might (and likely will) get a nice Marketshare, but it will not seriously threaten x86-64 for decades, if ever.
One thing that ARM is sorely lacking are some workstations to test stuff on. The Ampere eMag was based on ancient hardware, Raspberry Pi isn't specced nearly the same, and I'm not putting an Ampere Altra on my desk.
Ampere Altra *Server* that is. I'd love to get a system with the CPU, but priced in the realm of "Let's tinker with it and try it out" along with "Let's not cool it with 15000+rpm 40mm fans".
"ARM on the server is actually gaining foothold" They have won some niches and are expanding from there. I don't think they have enough fab capacity to build all the processors they could sell (especially as AMD is capacity-limited and Intel is - apparently - yield limited).
In the intervening 7 years, it has only become more obvious as an eventuality. Unless you're denying the existence of AWS' serious investment into that ecosystem...
Nvidia will have no impact on arm improvements. They merely seek to take Intel and AMD out of the equation by pairing Custom Arm servers with their gpus.
And by pointing this out I mean that NVIDIA have no intention of taking Intel or AMD out of the equation. They want their GPUs to be used anywhere with any CPU. The problem is Intel and AMD potentially taking NVIDIA's GPUs out of the equation.
Please don't paint Nvidia as a victim. They are not. All of these guys will have to support each other, for the foreseeable future, and for purely pragmatic reasons.
They are not 'guys'. They're corporations. Corporations were invented to, to quote Ambrose Bierce, grant 'individual profit without individual responsibility'.
People humanize corporations all the time. It doesn't lead to good outcomes for societies.
Of course, it's questionable whether corporations lead to good outcomes, considering that they're founded on scamming people (profit being 'sell less for more', needing tricks to get people to agree to that).
As we're rebuilding our server test suite, I'll be looking into more diverse benchmarks to include. It's a long process that needs a lot of thought and possibly resources so it's not always evident to achieve.
You guys should really include some workloads involving multiple <= 16-core/32-thread VMs, that could highlight the performance advantages of NPS4 mode. Even if all you did was partition up the system into smaller VMs running multithreaded SPEC 2017 tests, at least that would be *something*.
That said, please don't get rid of all system-wide multithreaded tests, because we definitely still want to see how well these systems scale (both single- and multi- CPU).
Looks decent, though the price and TDP increases make it look less appealing at the high end than it otherwise would. Perks of reusing the same process for two generations, I suppose.
Going to be a very interesting compare against Altra Max.
Absolutely, Milan and Altra are almost exactly as fast on SPECINT (Altra wins 1S, Milan wins 2S, both by ~1%). Altra Max will give a clear answer as to whether it is better to have 128 threads or 128 cores.
All the enterprise distros support secure boot so that isn’t really a factor (RHEL, SEL, Ubuntu, Debian, etc.). It doesn’t matter that random pet projects with 1 or 2 contributors don’t support it in this context.
I assume EPYC contains AMD's extra black box CPU. Can those with large-enough wallets get that functionality excised, as China reportedly did for the Zen 1 tech deal?
I understand that "zen" architecture is for x86 but with modifications could it be transplanted to the ARM instruction set, as i see it, it definitely could so the real question is when will the transition really start i think around the theoretical zen 5th gen or 6th gen, theres gonna be a lot of arm around here especially with apple. and yes it will defenitly start wiht servers it always does.
There are really two things at work : the instruction set of the processor and its topology. AMD has been improving both quite a bit. The instruction set enhancements won't transfer quite so well to ARM but the topology certainly can. Since ARM processors are much smaller, they could probably work in chiplets with possibly 32 cores in each or maybe 16 cores and 4-way SMT. That could make for a very impressive server processor. Four chiplets would give 64 cores and 256 threads. Yikes!
There are pieces of it that can be reused (on the same manufacturing node, at least), but making a truly-competitive ARM chip is probably going to involve some serious tinkering with the pipeline stages & architecture. And there are significant parts of an x86 chip that you'd have to throw out and redo, most notably the instruction decoder.
In all, it's a different core that you're talking about. Not like CPU vs. GPU level of difference, but it's a lot more than just cosmetics.
"For this launch, both the 16-core F and 24-core F have the same TDP, so the only reason I can think of for AMD to have a higher price on the 16-core processor is that it only has 2 cores per chiplet active, rather than three? Perhaps it is easier to bin a processor with an even number of cores active."
If I were to speculate, I would strongly guess that the actual reason is licensing. AMD knows that more people are going to want the 16 core CPUs in order to fit into certain brackets of software licensing, so AMD charges more for those to maximize profit and availability of the 16 core parts. For those customers, moving to a 24 core processor would probably mean paying *significantly* more for whatever software they're licensing.
I'd really like to see you all test a 7543 to compare against the 75F3. If the Per Thread performance (Page 8) of that chip can beat the 7713, it might be a great option for VMware environments where folks want to stick to a single license/socket without needing the beastly 75F3
Basically if you have three servers at 50% load you shut one off and now deliver power to only two servers running at 75% load. An idle server will consume 100+ watts (as high idle power is not an issue for server farms) - so by running two servers at 75% versus three at 50% you basically save 100 watts. (in many cases, server farms are actually power - i.e. electrical energy delivery or cooling - limited).
Probably something to do with thermals + reliability - recall in the datacenter theres a bunch of server blades stuffed into racks. Plus they are running 24/7. Plus the cooling system isn't generally as robust as on a desktop (costs electricity to run). Bottom line is that server parts tend to run at lower clocks than desktop parts for a mix of all of these reasons.
Server processors are NOT workstations, they are not intended for tiny workloads where there might only be a few things going on at one time. if you want more cores but want to use the machine like a workstation, you go Threadripper.
It's actually an impressive improvement. However Milan is getting power and memory bandwidth limited. It will take a new process and DDR5 to achieve significantly more performance.
"As the first generation Naples was launched, it offered impressive some performance numbers." Rearange words: "As the first generation Naples was launched, it offered some impressive performance numbers."
"All of these processors can be use in dual socket configurations." "used" not "use": "All of these processors can be used in dual socket configurations."
"... I see these to chips as the better apples-to-apples generational comparison, ..." "two" not "to": "... I see these two chips as the better apples-to-apples generational comparison, ..."
"There is always room for improvement, but if AMD equip themselves with a good IO update next generation,..." Missing "s": "There is always room for improvement, but if AMD equips themselves with a good IO update next generation,..."
Milan's IO die really seems to be the Achilles heel of these CPUs. Perhaps AMD should have segregated the line up into superior memory performance and features Milan IO die and superior compute performance (but inferior features) Rome IO die.
The Zen4 generation will make the move to DDR5 memory, so new memory controller, socket, and other aspects. Also, as time goes on, the contracts with Global Foundries for how much they make for AMD will expire. As it stands now, the use of Global is entirely to fulfill the contracts and avoid paying any early termination fees.
TSMC still can not make enough chiplets (I think its production is sold until 2023). Using Global Foundry IO dies means AMD can make one 8+1 instead of 8 processors (or 4+1 instead of 4).
Well, if it does well on the benchmarks that align with your workload, then I'd certainly consider at least a single-CPU Altra. IIRC, the multi-CPU interconnect was one of its weak points. You could even go dual-CPU, if you're provisioning VMs that fit on a single CPU (or better yet, just one quadrant).
Probably either when demand for the 3000-series Threadrippers starts slipping or if/when the supply of top-binned Zen3 dies ever catches up.
It would be interesting to see what performance could be extracted from these CPUs, if AMD would raise the power/thermal limit another 100 W. Maybe the 5000-series TR Pro will be our chance to find out!
Someone please remind me why Altra's memory performance is so much stronger. Is it simply down to avoiding the cache write-miss penalty? I'm pretty sure x86 CPUs long-ago added store buffers to fix that, but I can't think of any other explanation for that incredible stream benchmark discrepancy!
It's due to the Neoverse N1 cores being able to dynamically transform arbitrary memory writes into non-temporal write streams instead of doing regular RFO before a write as the x86 systems are currently doing. I explain it more in the Altra review:
That's more or less what I recall, but do you know it's *truly* emitting non-temporal stores? Those partially-bypass some or all of the cache hierarchy (I seem to recall that the Pentium 4 actually just restricted them to one set of L2 cache). It would seem to me that implausibly deep analysis would be needed for the CPU to determine that the core in question wouldn't access the data before it was replaced. And that's not even to speak of determining whether code running on *other* cores might need it.
On the other hand, if it simply has enough write-buffering, it could avoid fetching the target cacheline by accumulating enough adjacent stores to determine that the entire cacheline would be overwritten. Of course, the downside would be a tiny bit more write latency, and memory-ordering constraints (esp. for x86) might mean that it'd only work for groups of consecutive stores to consecutive addresses.
I guess a way to eliminate some of those restrictions would be to observe through analysis of the instruction stream that a group of stores would overwrite the cacheline and then issue an allocation instead of a fetch. Maybe that's what Altra is doing?
You're over-complicating things. The core simply sees a stream pattern and switches over to nontemporal writes. They can fully saturate the memory controller when doing just pure write patterns.
But, do you know they're truly non-temporal writes? As I've tried to explain, there are ways to avoid the write-miss penalty without using true non-temporal writes.
And how much of that are you inferring vs. basing this on what you've been told from official or unofficial sources?
It's not the easiest thing to confirm with a test, since you'd have to come along behind the writer and observe that a write that SHOULD still be in cache isn't.
Can someone please explain how is it possible that the power consumption of the all package is so much higher than the power consumption of the actual cores doing the work?
Because the I/O die is running on an older 14nm process and is servicing all of the cores. In a 64-core CPU, the per-core power use of the I/O die is less than 2W. Still too much, of course, but in context not as obscene as it looks when you look at the total power.
Congrats for going more in-depth for the per-core performance! For many enterprise buyers, this is the most (only?) important metric. I do suspect, that in this regard, the 8 core 72F3 will actually be the best 3rd gen EPYC!
But to better understand this, we need more test and per-core comparisons. I would suggest comparing: * All current AMD fast/frequency optimized CPUs - EPYC 72F3, 73F3, ... * Previous gen AMD fast/frequency CPUs like EPYC 7F32, ... * Intel Frequency optimized CPUs like Xeon Gold 6250, 6244, ...
The only metric that matters is per-core performance under full *sustained* load.
Exploring the dynamic TDP of AMD EPYC 3rd gen is also an interesting option. For example, I am quite curious about configuring 72F3 with 200W instead of the default 180W.
In summary, the difference in performance 9 vs 8 for (Milan vs Rome) means they are EQUAL. Not a single specific application which shows more than that. So much for the many months of hype and blahblah.
These inter-core synchronisation latency plots are slightly misleading, or at least not representative of "real software". By fixing the cache line that is used to the first core in the system and then ping-ponging it between to other cores you do not measure core-core latency, but rather core-to-cacheline-to-core, as expressed in the article. This is not how inter-thread communication usually works (in well-designed software). Allocating the cache line on the memory local to one of the ping-pong threads would make the plot more informative (although a bit more boring).
Are you saying a single memory address is used for all combinations of core x core?
Ultimately, I wonder if it makes any difference which NUMA domain the address is in, for a benchmark like this. Once it's in L1 cache, that's what you're measuring, no matter the physical memory address.
Also, I take issue with the suggestion that core-to-core communication necessarily involves memory in one of the core's NUMA domains. A lot of cases where real-world software is impacted by core-to-core latency involves global mutexes and atomic counters that won't necesarily be local to either core.
Yes, otherwise the SE quadrant (socket 2 to socket 2 communication) would look identical to the NW quadrant, right?
It does matter on which NUMA node the address is in, this is exactly what is addressed later in the article about Xeon having a better cache coherency protocol where this is less of an issue.
From the software side, I was more thinking of HPC applications where a situation of threads exchanging data that is owned by one of them is the norm, e.g. using OpenMP or MPI. That is indeed a different situation from contention on global mutexes.
How often is MPI used for communication *within* a shared-memory domain? I tend to think of it almost exclusively as a solution for inter-node communication.
Even if you have a nice two-tiered approach implemented in your software, let's say MPI for the distributed memory parallelization on top of OpenMP for the shared memory parallelization, it often turns out to be faster to limit the shared memory threads to a single socket of NUMA domain. So in case of an 2P EPYC configured as NPS4 you would have 8 MPI ranks per compute node.
But of course there's plenty of software that has parallelization implemented using MPI only, so you would need a separate process for each core. This is often because of legacy reasons, with software that was originally targetting only a couple of cores. But with the MPI 3.0 shared memory extension, this can even today be a valid approach to great performing hybrid (shared/distributed mem) code.
The fact that I'm using a cache line spawned on a third main thread which does nothing with it is irrelevant to the real-world comparison because from the hardware perspective the CPU doesn't know which thread owns it - in the test the hardware just sees two cores using that cache line, the third main thread becomes completely irrelevant in the discussion.
The thing that is guaranteed with the main starter thread allocating the synchronisation cache line is that it remains static across the measurements. One doesn't actually have control where this cache line ends up within the coherent domain of the whole CPU, it's going to end up in a specific L3 cache slice depended on the CPU's address hash positioning. The method here simply maintains that positioning to be always the same.
There is no such thing as core-core latency because cores do not snoop each other directly, they go over the coherency domain which is the L3 or the interconnect. It's always core-to-cacheline-to-core, as anything else doesn't even exist from the hardware perspective.
The original thread may have nothing to do with it, but the NUMA domain where the cache line was originally allocated certainly does. How would you otherwise explain the difference between the first quadrant for socket 1 to socket 1 communication and the fourth quadrant for socket 2 to socket 2 communication?
Your explanation about address hashing to determine the L3 cache slice may be makes sense when talking about fixing the inital thread within a L3 domain, but not why you want that that L3 domain fixed to the first one in the system, regardless of the placement of the two threads doing the ping-ponging.
And about core-core latency, you are of course right, that is sloppy wording on my part. What I meant to convey is that roundtrip latency between core-cacheline-core and back is more relevant (at least for HPC applications) when the cacheline is local to one of the cores and not remote, possibly even on another socket than the two thread.
I don't get your point - don't look at the intra-remote socket figures then if that doesn't interest you - these systems are still able to work in a single NUMA node across both sockets, so it's still pretty valid in terms of how things work.
I'm not fixing it to a given L3 in the system (except for that socket), binding a thread doesn't tell the hardware to somehow stick that cacheline there forever, software has zero say in that. As you see in the results it's able to move around between the different L3's and CCXs. Intel moves (or mirrors it) it around between sockets and NUMA domains, so your premise there also isn't correct in that case, AMD currently can't because probably they don't have a way to decide most recent ownership between two remote CCXs.
People may want to just look at the local socket numbers if they prioritise that, the test method here merely just exposes further more complicated scenarios which I find interesting as they showcase fundamental cache coherency differences between the platforms.
For a quick overview of how cores are related to each other (with an allocation local to one of the cores), I like this way of visualizing it more: http://bosmans.ch/share/naples-core-latency.png Here you can for example clearly see how the four dies of the two sockets are connected pairwise.
The plots from the article are interesting in that they show the vast difference between the cc protocols of AMD and Intel. And the numbers from the Naples plot I've linked can be mostly gotten from the more elaborate plots from the article, although it is not entirely clear to me how to exactly extend the data to form my style of plots. That's why I prefer to measure the data I'm interested in directly and plot that.
Great work! However I'm not getting why in the c2c matrix cores 62 and 74 wouldn't have a ~90ns latency as in the NW socket. Could you clarify how the test works?
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
120 Comments
Back to Article
Memo.Ray - Monday, March 15, 2021 - link
Thanks for the excellent review team Anand!ballsystemlord - Monday, March 15, 2021 - link
Actually, it was Ian and Andrei, not Anand who did the review. I'm not joking, he's a real person who used to work on this site: https://www.anandtech.com/print/1635/velanapontinha - Monday, March 15, 2021 - link
He was not addressing Anand. He is adressing "team Anand" which obviouisly means "the people actually workin on Anandtech"ballsystemlord - Monday, March 15, 2021 - link
I was thinking of a one person (Anand) team. ;)I should have gotten what he said though.
Gothmoth - Monday, March 15, 2021 - link
Team Anand! Since when is a team a single person....bigboxes - Tuesday, March 16, 2021 - link
He didn't get that. Must have missed the "team"plonk420 - Monday, March 22, 2021 - link
yeah, he was a bit too excited to ACKCHYUALLY someone elseSharma_Ji - Wednesday, March 17, 2021 - link
Who used to work - lolHe is the co founder lol
herozeros - Thursday, March 18, 2021 - link
definitely didn't think comment lurking would make me feel old today. lolMemo.Ray - Monday, March 15, 2021 - link
I know my English is a "lil" rough on the edges but I gotta admit, I didn't see these comments coming! lol.gustavowoltmann - Saturday, March 27, 2021 - link
https://www.anandtech.com/Calin - Tuesday, March 16, 2021 - link
That's "team Anandtech" but from the times when the site was done by a single person.I used to read Tom's Hardware, then it and Anandtech (sometime before 2000 I think, the site was started in 1997), then Anandtech only.
Lots of quality hardware information, even though at first the site sometime covered antiviruses and some other software.
Considering the entire history, reviewers other than Anand are a recent phenomenon :)
MenhirMike - Monday, March 15, 2021 - link
I wonder if AMD is going to add 120 W CPUs again - EPYC Rome had 4 CPUs with only 4 memory channels of bandwidth, but with a lower TDP, including the EPYC 7282.zanon - Monday, March 15, 2021 - link
They do have an EPYC Embedded (3000 series) line that's still Zen 1. Maybe they'll move that to Zen 3 and that's where the low TDP stuff will go?Foeketijn - Monday, March 15, 2021 - link
Yes, it's a shame those type of parts didn't really get attention yet. It's really great you can get 128 Cores and 256 threads in a 2U server, But if you just need 20 VM's running on a super stable platform, 16 threads and 50 Watts are more than enough.Spunjji - Friday, March 19, 2021 - link
I believe they're leaving that segment to Romepowerarmour - Monday, March 15, 2021 - link
ARM is going to be the tech to watch in this space IMHO, especially with NVIDIA's upcoming weight behind it.TheinsanegamerN - Monday, March 15, 2021 - link
2014 called and wants its prediction back.powerarmour - Monday, March 15, 2021 - link
Ampere Altra responded to the call, but is currently engaged.MenhirMike - Monday, March 15, 2021 - link
It's not as egregious as "Linux on the Desktop": ARM on the server is actually gaining foothold, especially for Cloud-hosted companies.Though x86-64 will be around for a LONG time - ARM might (and likely will) get a nice Marketshare, but it will not seriously threaten x86-64 for decades, if ever.
One thing that ARM is sorely lacking are some workstations to test stuff on. The Ampere eMag was based on ancient hardware, Raspberry Pi isn't specced nearly the same, and I'm not putting an Ampere Altra on my desk.
MenhirMike - Monday, March 15, 2021 - link
Ampere Altra *Server* that is. I'd love to get a system with the CPU, but priced in the realm of "Let's tinker with it and try it out" along with "Let's not cool it with 15000+rpm 40mm fans".kgardas - Monday, March 15, 2021 - link
Avantek provides some workstation as a more silent solution: https://www.avantek.co.uk/ampere-emag-arm-workstat... -- I'll leave price options to you...MenhirMike - Tuesday, March 16, 2021 - link
Yeah, but that Avantek is old tech: https://www.anandtech.com/show/15733/ampere-emag-s...Calin - Tuesday, March 16, 2021 - link
"ARM on the server is actually gaining foothold"They have won some niches and are expanding from there.
I don't think they have enough fab capacity to build all the processors they could sell (especially as AMD is capacity-limited and Intel is - apparently - yield limited).
Spunjji - Friday, March 19, 2021 - link
In the intervening 7 years, it has only become more obvious as an eventuality. Unless you're denying the existence of AWS' serious investment into that ecosystem...Wilco1 - Sunday, March 21, 2021 - link
Yes, Graviton is already 14% of AWS and still growing fast.prisonerX - Monday, March 15, 2021 - link
ARM prediction is probably good, but not with NVIDIA, they're unlikely to be approved.Crazyeyeskillah - Monday, March 15, 2021 - link
Nvidia will have no impact on arm improvements. They merely seek to take Intel and AMD out of the equation by pairing Custom Arm servers with their gpus.Yojimbo - Monday, March 15, 2021 - link
NVIDIA can have servers with custom ARM chips without buying ARM.Yojimbo - Monday, March 15, 2021 - link
And by pointing this out I mean that NVIDIA have no intention of taking Intel or AMD out of the equation. They want their GPUs to be used anywhere with any CPU. The problem is Intel and AMD potentially taking NVIDIA's GPUs out of the equation.mode_13h - Monday, March 15, 2021 - link
Please don't paint Nvidia as a victim. They are not. All of these guys will have to support each other, for the foreseeable future, and for purely pragmatic reasons.Oxford Guy - Monday, March 15, 2021 - link
They are not 'guys'. They're corporations. Corporations were invented to, to quote Ambrose Bierce, grant 'individual profit without individual responsibility'.mode_13h - Wednesday, March 17, 2021 - link
No disagreement, but I'm slightly disheartened you decided to take issue with my use of the term "guys". I'll try harder, next time--just for you.Oxford Guy - Tuesday, April 6, 2021 - link
People humanize corporations all the time. It doesn't lead to good outcomes for societies.Of course, it's questionable whether corporations lead to good outcomes, considering that they're founded on scamming people (profit being 'sell less for more', needing tricks to get people to agree to that).
chavv - Monday, March 15, 2021 - link
Is it possible to add another "benchmark" - ESX server workload?Like, running 8-16-32-64 VMs all with some workload...
Andrei Frumusanu - Monday, March 15, 2021 - link
As we're rebuilding our server test suite, I'll be looking into more diverse benchmarks to include. It's a long process that needs a lot of thought and possibly resources so it's not always evident to achieve.eva02langley - Monday, March 15, 2021 - link
Just buy EPYC and start your hybridation and your reliance on a SINGLE supplier...eva02langley - Monday, March 15, 2021 - link
edit: Just buy EPYC and start your hybridation and STOP your reliance on a SINGLE supplier...mode_13h - Monday, March 15, 2021 - link
You guys should really include some workloads involving multiple <= 16-core/32-thread VMs, that could highlight the performance advantages of NPS4 mode. Even if all you did was partition up the system into smaller VMs running multithreaded SPEC 2017 tests, at least that would be *something*.That said, please don't get rid of all system-wide multithreaded tests, because we definitely still want to see how well these systems scale (both single- and multi- CPU).
ishould - Monday, March 15, 2021 - link
Yes this seems more useful for my needs as well. We use a grid system for job submission and not all cores will be hammered at the same timenonoverclock - Monday, March 15, 2021 - link
When do we think this will be available to order? Also wondering the same about Ice Lake SP availability but seems it's hard to know for sure.SarahKerrigan - Monday, March 15, 2021 - link
Looks decent, though the price and TDP increases make it look less appealing at the high end than it otherwise would. Perks of reusing the same process for two generations, I suppose.Going to be a very interesting compare against Altra Max.
plb4333 - Monday, March 15, 2021 - link
wouldn't even have to be compared to the 'max' necessarily. Altra without the max is still a contender.Wilco1 - Sunday, March 21, 2021 - link
Absolutely, Milan and Altra are almost exactly as fast on SPECINT (Altra wins 1S, Milan wins 2S, both by ~1%). Altra Max will give a clear answer as to whether it is better to have 128 threads or 128 cores.ECC_or_GTFO - Monday, March 15, 2021 - link
Why won't AMD let us secure boot their CPUs? There is simply no valid argument except hiding backdoors at this point.JfromImaginstuff - Monday, March 15, 2021 - link
Well most Linux distros do not do well with secure boot and that is what is running on most severe these daysJfromImaginstuff - Monday, March 15, 2021 - link
*servers these daysBob Todd - Tuesday, March 30, 2021 - link
All the enterprise distros support secure boot so that isn’t really a factor (RHEL, SEL, Ubuntu, Debian, etc.). It doesn’t matter that random pet projects with 1 or 2 contributors don’t support it in this context.Oxford Guy - Monday, March 15, 2021 - link
I assume EPYC contains AMD's extra black box CPU. Can those with large-enough wallets get that functionality excised, as China reportedly did for the Zen 1 tech deal?mode_13h - Wednesday, March 17, 2021 - link
It's supposedly ARM TrustZone, right?Oxford Guy - Tuesday, April 6, 2021 - link
PSP, as far as I know.Linustechtips12#6900xt - Monday, March 15, 2021 - link
I understand that "zen" architecture is for x86 but with modifications could it be transplanted to the ARM instruction set, as i see it, it definitely could so the real question is when will the transition really start i think around the theoretical zen 5th gen or 6th gen, theres gonna be a lot of arm around here especially with apple. and yes it will defenitly start wiht servers it always does.Gomez Addams - Monday, March 15, 2021 - link
There are really two things at work : the instruction set of the processor and its topology. AMD has been improving both quite a bit. The instruction set enhancements won't transfer quite so well to ARM but the topology certainly can. Since ARM processors are much smaller, they could probably work in chiplets with possibly 32 cores in each or maybe 16 cores and 4-way SMT. That could make for a very impressive server processor. Four chiplets would give 64 cores and 256 threads. Yikes!rahvin - Monday, March 15, 2021 - link
So much wrong.mode_13h - Monday, March 15, 2021 - link
There are pieces of it that can be reused (on the same manufacturing node, at least), but making a truly-competitive ARM chip is probably going to involve some serious tinkering with the pipeline stages & architecture. And there are significant parts of an x86 chip that you'd have to throw out and redo, most notably the instruction decoder.In all, it's a different core that you're talking about. Not like CPU vs. GPU level of difference, but it's a lot more than just cosmetics.
coder543 - Monday, March 15, 2021 - link
"For this launch, both the 16-core F and 24-core F have the same TDP, so the only reason I can think of for AMD to have a higher price on the 16-core processor is that it only has 2 cores per chiplet active, rather than three? Perhaps it is easier to bin a processor with an even number of cores active."If I were to speculate, I would strongly guess that the actual reason is licensing. AMD knows that more people are going to want the 16 core CPUs in order to fit into certain brackets of software licensing, so AMD charges more for those to maximize profit and availability of the 16 core parts. For those customers, moving to a 24 core processor would probably mean paying *significantly* more for whatever software they're licensing.
SarahKerrigan - Monday, March 15, 2021 - link
Yep.Intel sold quad-core Xeon E7's for impressively high prices for a similar reason.
Mikewind Dale - Monday, March 15, 2021 - link
Why couldn't you run a 16 core software license on a 24 core CPU? I run a 4 core licensed version of Stata MP on an 8 core Ryzen just fine.Ithaqua - Monday, March 15, 2021 - link
Compliance and lawsuits.You have to pay for all the cores you use for some software.
Yes if you're only running 4 cores on your 8 core Ryzen then your fine but Stata MP is using all 8, there could be a lawsuit.
Now for you I'm sure they wouldn't care. For a larger firm with 10,000+ machines, then that's going to be a big lawsuit.
arashi - Wednesday, March 17, 2021 - link
Some licenses charge for ALL cores, regardless of how many cores you would actually be using.Casper42 - Monday, March 15, 2021 - link
I'd really like to see you all test a 7543 to compare against the 75F3.If the Per Thread performance (Page 8) of that chip can beat the 7713, it might be a great option for VMware environments where folks want to stick to a single license/socket without needing the beastly 75F3
Casper42 - Monday, March 15, 2021 - link
PS: I think it will also help come April and I hope you test multiple 32c offerings then too.Olaf van der Spek - Monday, March 15, 2021 - link
Why don't these parts boost to 4.5 - 5 GHz when using only one or two cores like the desktop parts?ishould - Monday, March 15, 2021 - link
Hoping to get an answer to this tooCalin - Tuesday, March 16, 2021 - link
Basically if you have three servers at 50% load you shut one off and now deliver power to only two servers running at 75% load.An idle server will consume 100+ watts (as high idle power is not an issue for server farms) - so by running two servers at 75% versus three at 50% you basically save 100 watts.
(in many cases, server farms are actually power - i.e. electrical energy delivery or cooling - limited).
coschizza - Monday, March 15, 2021 - link
stabilityJon Tseng - Monday, March 15, 2021 - link
Probably something to do with thermals + reliability - recall in the datacenter theres a bunch of server blades stuffed into racks. Plus they are running 24/7. Plus the cooling system isn't generally as robust as on a desktop (costs electricity to run). Bottom line is that server parts tend to run at lower clocks than desktop parts for a mix of all of these reasons.Targon - Monday, March 15, 2021 - link
Server processors are NOT workstations, they are not intended for tiny workloads where there might only be a few things going on at one time. if you want more cores but want to use the machine like a workstation, you go Threadripper.yeeeeman - Monday, March 15, 2021 - link
quite underwhelming tbh..ballsystemlord - Monday, March 15, 2021 - link
You expected? AMD has been overwhelming for years now, give them some slack. They can't do it every year.eva02langley - Monday, March 15, 2021 - link
You probably looking at the blue lines (Intel)... just saying...Targon - Monday, March 15, 2021 - link
Compared to what? Core count not increasing, but Zen3 is still a big improvement when it comes to IPC compared to Zen2.mode_13h - Monday, March 15, 2021 - link
We can hope that they find some microcode fixes to improve power allocation, and maybe a mid-cycle refresh with an updated I/O die.Spunjji - Friday, March 19, 2021 - link
How surprising, an Intel fanboy is unimpressed.Wilco1 - Sunday, March 21, 2021 - link
It's actually an impressive improvement. However Milan is getting power and memory bandwidth limited. It will take a new process and DDR5 to achieve significantly more performance.ballsystemlord - Monday, March 15, 2021 - link
Spelling and grammar errors:"As the first generation Naples was launched, it offered impressive some performance numbers."
Rearange words:
"As the first generation Naples was launched, it offered some impressive performance numbers."
"All of these processors can be use in dual socket configurations."
"used" not "use":
"All of these processors can be used in dual socket configurations."
"... I see these to chips as the better apples-to-apples generational comparison, ..."
"two" not "to":
"... I see these two chips as the better apples-to-apples generational comparison, ..."
"There is always room for improvement, but if AMD equip themselves with a good IO update next generation,..."
Missing "s":
"There is always room for improvement, but if AMD equips themselves with a good IO update next generation,..."
eva02langley - Monday, March 15, 2021 - link
If business don't buy EPYC by then, than they deserve all the issues coming with Intel CPUs.Otritus - Monday, March 15, 2021 - link
Milan's IO die really seems to be the Achilles heel of these CPUs. Perhaps AMD should have segregated the line up into superior memory performance and features Milan IO die and superior compute performance (but inferior features) Rome IO die.Targon - Monday, March 15, 2021 - link
The Zen4 generation will make the move to DDR5 memory, so new memory controller, socket, and other aspects. Also, as time goes on, the contracts with Global Foundries for how much they make for AMD will expire. As it stands now, the use of Global is entirely to fulfill the contracts and avoid paying any early termination fees.Calin - Tuesday, March 16, 2021 - link
TSMC still can not make enough chiplets (I think its production is sold until 2023).Using Global Foundry IO dies means AMD can make one 8+1 instead of 8 processors (or 4+1 instead of 4).
lejeczek - Monday, March 15, 2021 - link
But those Altra Q80-33 ... gee guys. I have been thinking for a while - next upgrade of the stack in the rack might as well be...mode_13h - Monday, March 15, 2021 - link
Well, if it does well on the benchmarks that align with your workload, then I'd certainly consider at least a single-CPU Altra. IIRC, the multi-CPU interconnect was one of its weak points. You could even go dual-CPU, if you're provisioning VMs that fit on a single CPU (or better yet, just one quadrant).Pinn - Monday, March 15, 2021 - link
When does this filter to the Threadrippers?mode_13h - Monday, March 15, 2021 - link
Probably either when demand for the 3000-series Threadrippers starts slipping or if/when the supply of top-binned Zen3 dies ever catches up.It would be interesting to see what performance could be extracted from these CPUs, if AMD would raise the power/thermal limit another 100 W. Maybe the 5000-series TR Pro will be our chance to find out!
mode_13h - Monday, March 15, 2021 - link
Someone please remind me why Altra's memory performance is so much stronger. Is it simply down to avoiding the cache write-miss penalty? I'm pretty sure x86 CPUs long-ago added store buffers to fix that, but I can't think of any other explanation for that incredible stream benchmark discrepancy!Andrei Frumusanu - Monday, March 15, 2021 - link
It's due to the Neoverse N1 cores being able to dynamically transform arbitrary memory writes into non-temporal write streams instead of doing regular RFO before a write as the x86 systems are currently doing. I explain it more in the Altra review:https://www.anandtech.com/show/16315/the-ampere-al...
mode_13h - Monday, March 15, 2021 - link
That's more or less what I recall, but do you know it's *truly* emitting non-temporal stores? Those partially-bypass some or all of the cache hierarchy (I seem to recall that the Pentium 4 actually just restricted them to one set of L2 cache). It would seem to me that implausibly deep analysis would be needed for the CPU to determine that the core in question wouldn't access the data before it was replaced. And that's not even to speak of determining whether code running on *other* cores might need it.On the other hand, if it simply has enough write-buffering, it could avoid fetching the target cacheline by accumulating enough adjacent stores to determine that the entire cacheline would be overwritten. Of course, the downside would be a tiny bit more write latency, and memory-ordering constraints (esp. for x86) might mean that it'd only work for groups of consecutive stores to consecutive addresses.
I guess a way to eliminate some of those restrictions would be to observe through analysis of the instruction stream that a group of stores would overwrite the cacheline and then issue an allocation instead of a fetch. Maybe that's what Altra is doing?
Andrei Frumusanu - Tuesday, March 16, 2021 - link
You're over-complicating things. The core simply sees a stream pattern and switches over to nontemporal writes. They can fully saturate the memory controller when doing just pure write patterns.mode_13h - Wednesday, March 17, 2021 - link
But, do you know they're truly non-temporal writes? As I've tried to explain, there are ways to avoid the write-miss penalty without using true non-temporal writes.And how much of that are you inferring vs. basing this on what you've been told from official or unofficial sources?
Andrei Frumusanu - Saturday, March 20, 2021 - link
It's 100% non-temporal writes, confirmed by both hardware tests and architects.mode_13h - Saturday, March 20, 2021 - link
Okay, thanks for confirming with them.mode_13h - Saturday, March 20, 2021 - link
It's not the easiest thing to confirm with a test, since you'd have to come along behind the writer and observe that a write that SHOULD still be in cache isn't.CBeddoe - Monday, March 15, 2021 - link
I'm excited by AMD's continuing design improvements.Can't wait to see what happens with the next node shrink. Intel has some catching up to do.
Ppietra - Tuesday, March 16, 2021 - link
Can someone please explain how is it possible that the power consumption of the all package is so much higher than the power consumption of the actual cores doing the work?Spunjji - Friday, March 19, 2021 - link
Because the I/O die is running on an older 14nm process and is servicing all of the cores. In a 64-core CPU, the per-core power use of the I/O die is less than 2W. Still too much, of course, but in context not as obscene as it looks when you look at the total power.Elstar - Tuesday, March 16, 2021 - link
Lest it go unsaid, I really appreciate the "compile a big C++ project" benchmark (i.e. LLVM). Thank you!Spunjji - Tuesday, March 16, 2021 - link
"To that end, all we have to compare Milan to is Intel’s Cascade Lake Xeon Scalable platform, which was the same platform we compared Rome to."Says it all, really. Good work AMD, and cheers to the team for the review!
Hifihedgehog - Tuesday, March 16, 2021 - link
Sysadmin: Ram? Rome?AMD: Milan, darling, Milan...
Ivan Argentinski - Tuesday, March 16, 2021 - link
Congrats for going more in-depth for the per-core performance! For many enterprise buyers, this is the most (only?) important metric. I do suspect, that in this regard, the 8 core 72F3 will actually be the best 3rd gen EPYC!But to better understand this, we need more test and per-core comparisons. I would suggest comparing:
* All current AMD fast/frequency optimized CPUs - EPYC 72F3, 73F3, ...
* Previous gen AMD fast/frequency CPUs like EPYC 7F32, ...
* Intel Frequency optimized CPUs like Xeon Gold 6250, 6244, ...
The only metric that matters is per-core performance under full *sustained* load.
Exploring the dynamic TDP of AMD EPYC 3rd gen is also an interesting option. For example, I am quite curious about configuring 72F3 with 200W instead of the default 180W.
Andrei Frumusanu - Saturday, March 20, 2021 - link
If we get more SKUs to test, I'll be sure to do so.aryonoco - Tuesday, March 16, 2021 - link
Thanks for the excellent article Andrei and Ian. Really appreciate your work.Just wondering, is Johan no longer inlvolved in server reviews? I'll really miss him.
Andrei Frumusanu - Saturday, March 20, 2021 - link
Johan is no longer part of AT.SanX - Tuesday, March 16, 2021 - link
In summary, the difference in performance 9 vs 8 for (Milan vs Rome) means they are EQUAL. Not a single specific application which shows more than that. So much for the many months of hype and blahblah.tyger11 - Tuesday, March 16, 2021 - link
Okay, now give us the new Zen 3 Threadripper Pro!AusMatt - Wednesday, March 17, 2021 - link
Page 4 text: "a 255 x 255 matrix" should read: "a 256 x 256 matrix".hmw - Friday, March 19, 2021 - link
What was the stepping for the Milan CPUs? B0? or B1?mkbosmans - Saturday, March 20, 2021 - link
These inter-core synchronisation latency plots are slightly misleading, or at least not representative of "real software". By fixing the cache line that is used to the first core in the system and then ping-ponging it between to other cores you do not measure core-core latency, but rather core-to-cacheline-to-core, as expressed in the article. This is not how inter-thread communication usually works (in well-designed software).Allocating the cache line on the memory local to one of the ping-pong threads would make the plot more informative (although a bit more boring).
mode_13h - Saturday, March 20, 2021 - link
Are you saying a single memory address is used for all combinations of core x core?Ultimately, I wonder if it makes any difference which NUMA domain the address is in, for a benchmark like this. Once it's in L1 cache, that's what you're measuring, no matter the physical memory address.
Also, I take issue with the suggestion that core-to-core communication necessarily involves memory in one of the core's NUMA domains. A lot of cases where real-world software is impacted by core-to-core latency involves global mutexes and atomic counters that won't necesarily be local to either core.
mkbosmans - Saturday, March 20, 2021 - link
Yes, otherwise the SE quadrant (socket 2 to socket 2 communication) would look identical to the NW quadrant, right?It does matter on which NUMA node the address is in, this is exactly what is addressed later in the article about Xeon having a better cache coherency protocol where this is less of an issue.
From the software side, I was more thinking of HPC applications where a situation of threads exchanging data that is owned by one of them is the norm, e.g. using OpenMP or MPI. That is indeed a different situation from contention on global mutexes.
mode_13h - Saturday, March 20, 2021 - link
How often is MPI used for communication *within* a shared-memory domain? I tend to think of it almost exclusively as a solution for inter-node communication.mkbosmans - Tuesday, March 23, 2021 - link
Even if you have a nice two-tiered approach implemented in your software, let's say MPI for the distributed memory parallelization on top of OpenMP for the shared memory parallelization, it often turns out to be faster to limit the shared memory threads to a single socket of NUMA domain. So in case of an 2P EPYC configured as NPS4 you would have 8 MPI ranks per compute node.But of course there's plenty of software that has parallelization implemented using MPI only, so you would need a separate process for each core. This is often because of legacy reasons, with software that was originally targetting only a couple of cores. But with the MPI 3.0 shared memory extension, this can even today be a valid approach to great performing hybrid (shared/distributed mem) code.
mode_13h - Tuesday, March 23, 2021 - link
Nice explanation. Thanks for following up!Andrei Frumusanu - Saturday, March 20, 2021 - link
This is vastly incorrect and misleading.The fact that I'm using a cache line spawned on a third main thread which does nothing with it is irrelevant to the real-world comparison because from the hardware perspective the CPU doesn't know which thread owns it - in the test the hardware just sees two cores using that cache line, the third main thread becomes completely irrelevant in the discussion.
The thing that is guaranteed with the main starter thread allocating the synchronisation cache line is that it remains static across the measurements. One doesn't actually have control where this cache line ends up within the coherent domain of the whole CPU, it's going to end up in a specific L3 cache slice depended on the CPU's address hash positioning. The method here simply maintains that positioning to be always the same.
There is no such thing as core-core latency because cores do not snoop each other directly, they go over the coherency domain which is the L3 or the interconnect. It's always core-to-cacheline-to-core, as anything else doesn't even exist from the hardware perspective.
mkbosmans - Saturday, March 20, 2021 - link
The original thread may have nothing to do with it, but the NUMA domain where the cache line was originally allocated certainly does. How would you otherwise explain the difference between the first quadrant for socket 1 to socket 1 communication and the fourth quadrant for socket 2 to socket 2 communication?Your explanation about address hashing to determine the L3 cache slice may be makes sense when talking about fixing the inital thread within a L3 domain, but not why you want that that L3 domain fixed to the first one in the system, regardless of the placement of the two threads doing the ping-ponging.
And about core-core latency, you are of course right, that is sloppy wording on my part. What I meant to convey is that roundtrip latency between core-cacheline-core and back is more relevant (at least for HPC applications) when the cacheline is local to one of the cores and not remote, possibly even on another socket than the two thread.
Andrei Frumusanu - Saturday, March 20, 2021 - link
I don't get your point - don't look at the intra-remote socket figures then if that doesn't interest you - these systems are still able to work in a single NUMA node across both sockets, so it's still pretty valid in terms of how things work.I'm not fixing it to a given L3 in the system (except for that socket), binding a thread doesn't tell the hardware to somehow stick that cacheline there forever, software has zero say in that. As you see in the results it's able to move around between the different L3's and CCXs. Intel moves (or mirrors it) it around between sockets and NUMA domains, so your premise there also isn't correct in that case, AMD currently can't because probably they don't have a way to decide most recent ownership between two remote CCXs.
People may want to just look at the local socket numbers if they prioritise that, the test method here merely just exposes further more complicated scenarios which I find interesting as they showcase fundamental cache coherency differences between the platforms.
mkbosmans - Tuesday, March 23, 2021 - link
For a quick overview of how cores are related to each other (with an allocation local to one of the cores), I like this way of visualizing it more:http://bosmans.ch/share/naples-core-latency.png
Here you can for example clearly see how the four dies of the two sockets are connected pairwise.
The plots from the article are interesting in that they show the vast difference between the cc protocols of AMD and Intel. And the numbers from the Naples plot I've linked can be mostly gotten from the more elaborate plots from the article, although it is not entirely clear to me how to exactly extend the data to form my style of plots. That's why I prefer to measure the data I'm interested in directly and plot that.
imaskar - Monday, March 29, 2021 - link
Looking at the shares sinking, this pricing was a miss...mode_13h - Tuesday, March 30, 2021 - link
Prices are a lot easier to lower than to raise. And as long as they can sell all their production allocation, the price won't have been too high.Zone98 - Friday, April 23, 2021 - link
Great work! However I'm not getting why in the c2c matrix cores 62 and 74 wouldn't have a ~90ns latency as in the NW socket. Could you clarify how the test works?node55 - Tuesday, April 27, 2021 - link
Why are the cpus not consistent?Why do you switch between 7713 and 7763 on Milan and 7662 and 7742 on Rome?
Why do you not have results for all the server CPUs? This confuses the comparison of e.g. 7662 vs 7713. (My current buying decision )