Comments Locked

120 Comments

Back to Article

  • Memo.Ray - Monday, March 15, 2021 - link

    Thanks for the excellent review team Anand!
  • ballsystemlord - Monday, March 15, 2021 - link

    Actually, it was Ian and Andrei, not Anand who did the review. I'm not joking, he's a real person who used to work on this site: https://www.anandtech.com/print/1635/
  • velanapontinha - Monday, March 15, 2021 - link

    He was not addressing Anand. He is adressing "team Anand" which obviouisly means "the people actually workin on Anandtech"
  • ballsystemlord - Monday, March 15, 2021 - link

    I was thinking of a one person (Anand) team. ;)
    I should have gotten what he said though.
  • Gothmoth - Monday, March 15, 2021 - link

    Team Anand! Since when is a team a single person....
  • bigboxes - Tuesday, March 16, 2021 - link

    He didn't get that. Must have missed the "team"
  • plonk420 - Monday, March 22, 2021 - link

    yeah, he was a bit too excited to ACKCHYUALLY someone else
  • Sharma_Ji - Wednesday, March 17, 2021 - link

    Who used to work - lol

    He is the co founder lol
  • herozeros - Thursday, March 18, 2021 - link

    definitely didn't think comment lurking would make me feel old today. lol
  • Memo.Ray - Monday, March 15, 2021 - link

    I know my English is a "lil" rough on the edges but I gotta admit, I didn't see these comments coming! lol.
  • gustavowoltmann - Saturday, March 27, 2021 - link

    https://www.anandtech.com/
  • Calin - Tuesday, March 16, 2021 - link

    That's "team Anandtech" but from the times when the site was done by a single person.
    I used to read Tom's Hardware, then it and Anandtech (sometime before 2000 I think, the site was started in 1997), then Anandtech only.
    Lots of quality hardware information, even though at first the site sometime covered antiviruses and some other software.
    Considering the entire history, reviewers other than Anand are a recent phenomenon :)
  • MenhirMike - Monday, March 15, 2021 - link

    I wonder if AMD is going to add 120 W CPUs again - EPYC Rome had 4 CPUs with only 4 memory channels of bandwidth, but with a lower TDP, including the EPYC 7282.
  • zanon - Monday, March 15, 2021 - link

    They do have an EPYC Embedded (3000 series) line that's still Zen 1. Maybe they'll move that to Zen 3 and that's where the low TDP stuff will go?
  • Foeketijn - Monday, March 15, 2021 - link

    Yes, it's a shame those type of parts didn't really get attention yet. It's really great you can get 128 Cores and 256 threads in a 2U server, But if you just need 20 VM's running on a super stable platform, 16 threads and 50 Watts are more than enough.
  • Spunjji - Friday, March 19, 2021 - link

    I believe they're leaving that segment to Rome
  • powerarmour - Monday, March 15, 2021 - link

    ARM is going to be the tech to watch in this space IMHO, especially with NVIDIA's upcoming weight behind it.
  • TheinsanegamerN - Monday, March 15, 2021 - link

    2014 called and wants its prediction back.
  • powerarmour - Monday, March 15, 2021 - link

    Ampere Altra responded to the call, but is currently engaged.
  • MenhirMike - Monday, March 15, 2021 - link

    It's not as egregious as "Linux on the Desktop": ARM on the server is actually gaining foothold, especially for Cloud-hosted companies.

    Though x86-64 will be around for a LONG time - ARM might (and likely will) get a nice Marketshare, but it will not seriously threaten x86-64 for decades, if ever.

    One thing that ARM is sorely lacking are some workstations to test stuff on. The Ampere eMag was based on ancient hardware, Raspberry Pi isn't specced nearly the same, and I'm not putting an Ampere Altra on my desk.
  • MenhirMike - Monday, March 15, 2021 - link

    Ampere Altra *Server* that is. I'd love to get a system with the CPU, but priced in the realm of "Let's tinker with it and try it out" along with "Let's not cool it with 15000+rpm 40mm fans".
  • kgardas - Monday, March 15, 2021 - link

    Avantek provides some workstation as a more silent solution: https://www.avantek.co.uk/ampere-emag-arm-workstat... -- I'll leave price options to you...
  • MenhirMike - Tuesday, March 16, 2021 - link

    Yeah, but that Avantek is old tech: https://www.anandtech.com/show/15733/ampere-emag-s...
  • Calin - Tuesday, March 16, 2021 - link

    "ARM on the server is actually gaining foothold"
    They have won some niches and are expanding from there.
    I don't think they have enough fab capacity to build all the processors they could sell (especially as AMD is capacity-limited and Intel is - apparently - yield limited).
  • Spunjji - Friday, March 19, 2021 - link

    In the intervening 7 years, it has only become more obvious as an eventuality. Unless you're denying the existence of AWS' serious investment into that ecosystem...
  • Wilco1 - Sunday, March 21, 2021 - link

    Yes, Graviton is already 14% of AWS and still growing fast.
  • prisonerX - Monday, March 15, 2021 - link

    ARM prediction is probably good, but not with NVIDIA, they're unlikely to be approved.
  • Crazyeyeskillah - Monday, March 15, 2021 - link

    Nvidia will have no impact on arm improvements. They merely seek to take Intel and AMD out of the equation by pairing Custom Arm servers with their gpus.
  • Yojimbo - Monday, March 15, 2021 - link

    NVIDIA can have servers with custom ARM chips without buying ARM.
  • Yojimbo - Monday, March 15, 2021 - link

    And by pointing this out I mean that NVIDIA have no intention of taking Intel or AMD out of the equation. They want their GPUs to be used anywhere with any CPU. The problem is Intel and AMD potentially taking NVIDIA's GPUs out of the equation.
  • mode_13h - Monday, March 15, 2021 - link

    Please don't paint Nvidia as a victim. They are not. All of these guys will have to support each other, for the foreseeable future, and for purely pragmatic reasons.
  • Oxford Guy - Monday, March 15, 2021 - link

    They are not 'guys'. They're corporations. Corporations were invented to, to quote Ambrose Bierce, grant 'individual profit without individual responsibility'.
  • mode_13h - Wednesday, March 17, 2021 - link

    No disagreement, but I'm slightly disheartened you decided to take issue with my use of the term "guys". I'll try harder, next time--just for you.
  • Oxford Guy - Tuesday, April 6, 2021 - link

    People humanize corporations all the time. It doesn't lead to good outcomes for societies.

    Of course, it's questionable whether corporations lead to good outcomes, considering that they're founded on scamming people (profit being 'sell less for more', needing tricks to get people to agree to that).
  • chavv - Monday, March 15, 2021 - link

    Is it possible to add another "benchmark" - ESX server workload?
    Like, running 8-16-32-64 VMs all with some workload...
  • Andrei Frumusanu - Monday, March 15, 2021 - link

    As we're rebuilding our server test suite, I'll be looking into more diverse benchmarks to include. It's a long process that needs a lot of thought and possibly resources so it's not always evident to achieve.
  • eva02langley - Monday, March 15, 2021 - link

    Just buy EPYC and start your hybridation and your reliance on a SINGLE supplier...
  • eva02langley - Monday, March 15, 2021 - link

    edit: Just buy EPYC and start your hybridation and STOP your reliance on a SINGLE supplier...
  • mode_13h - Monday, March 15, 2021 - link

    You guys should really include some workloads involving multiple <= 16-core/32-thread VMs, that could highlight the performance advantages of NPS4 mode. Even if all you did was partition up the system into smaller VMs running multithreaded SPEC 2017 tests, at least that would be *something*.

    That said, please don't get rid of all system-wide multithreaded tests, because we definitely still want to see how well these systems scale (both single- and multi- CPU).
  • ishould - Monday, March 15, 2021 - link

    Yes this seems more useful for my needs as well. We use a grid system for job submission and not all cores will be hammered at the same time
  • nonoverclock - Monday, March 15, 2021 - link

    When do we think this will be available to order? Also wondering the same about Ice Lake SP availability but seems it's hard to know for sure.
  • SarahKerrigan - Monday, March 15, 2021 - link

    Looks decent, though the price and TDP increases make it look less appealing at the high end than it otherwise would. Perks of reusing the same process for two generations, I suppose.

    Going to be a very interesting compare against Altra Max.
  • plb4333 - Monday, March 15, 2021 - link

    wouldn't even have to be compared to the 'max' necessarily. Altra without the max is still a contender.
  • Wilco1 - Sunday, March 21, 2021 - link

    Absolutely, Milan and Altra are almost exactly as fast on SPECINT (Altra wins 1S, Milan wins 2S, both by ~1%). Altra Max will give a clear answer as to whether it is better to have 128 threads or 128 cores.
  • ECC_or_GTFO - Monday, March 15, 2021 - link

    Why won't AMD let us secure boot their CPUs? There is simply no valid argument except hiding backdoors at this point.
  • JfromImaginstuff - Monday, March 15, 2021 - link

    Well most Linux distros do not do well with secure boot and that is what is running on most severe these days
  • JfromImaginstuff - Monday, March 15, 2021 - link

    *servers these days
  • Bob Todd - Tuesday, March 30, 2021 - link

    All the enterprise distros support secure boot so that isn’t really a factor (RHEL, SEL, Ubuntu, Debian, etc.). It doesn’t matter that random pet projects with 1 or 2 contributors don’t support it in this context.
  • Oxford Guy - Monday, March 15, 2021 - link

    I assume EPYC contains AMD's extra black box CPU. Can those with large-enough wallets get that functionality excised, as China reportedly did for the Zen 1 tech deal?
  • mode_13h - Wednesday, March 17, 2021 - link

    It's supposedly ARM TrustZone, right?
  • Oxford Guy - Tuesday, April 6, 2021 - link

    PSP, as far as I know.
  • Linustechtips12#6900xt - Monday, March 15, 2021 - link

    I understand that "zen" architecture is for x86 but with modifications could it be transplanted to the ARM instruction set, as i see it, it definitely could so the real question is when will the transition really start i think around the theoretical zen 5th gen or 6th gen, theres gonna be a lot of arm around here especially with apple. and yes it will defenitly start wiht servers it always does.
  • Gomez Addams - Monday, March 15, 2021 - link

    There are really two things at work : the instruction set of the processor and its topology. AMD has been improving both quite a bit. The instruction set enhancements won't transfer quite so well to ARM but the topology certainly can. Since ARM processors are much smaller, they could probably work in chiplets with possibly 32 cores in each or maybe 16 cores and 4-way SMT. That could make for a very impressive server processor. Four chiplets would give 64 cores and 256 threads. Yikes!
  • rahvin - Monday, March 15, 2021 - link

    So much wrong.
  • mode_13h - Monday, March 15, 2021 - link

    There are pieces of it that can be reused (on the same manufacturing node, at least), but making a truly-competitive ARM chip is probably going to involve some serious tinkering with the pipeline stages & architecture. And there are significant parts of an x86 chip that you'd have to throw out and redo, most notably the instruction decoder.

    In all, it's a different core that you're talking about. Not like CPU vs. GPU level of difference, but it's a lot more than just cosmetics.
  • coder543 - Monday, March 15, 2021 - link

    "For this launch, both the 16-core F and 24-core F have the same TDP, so the only reason I can think of for AMD to have a higher price on the 16-core processor is that it only has 2 cores per chiplet active, rather than three? Perhaps it is easier to bin a processor with an even number of cores active."

    If I were to speculate, I would strongly guess that the actual reason is licensing. AMD knows that more people are going to want the 16 core CPUs in order to fit into certain brackets of software licensing, so AMD charges more for those to maximize profit and availability of the 16 core parts. For those customers, moving to a 24 core processor would probably mean paying *significantly* more for whatever software they're licensing.
  • SarahKerrigan - Monday, March 15, 2021 - link

    Yep.

    Intel sold quad-core Xeon E7's for impressively high prices for a similar reason.
  • Mikewind Dale - Monday, March 15, 2021 - link

    Why couldn't you run a 16 core software license on a 24 core CPU? I run a 4 core licensed version of Stata MP on an 8 core Ryzen just fine.
  • Ithaqua - Monday, March 15, 2021 - link

    Compliance and lawsuits.
    You have to pay for all the cores you use for some software.

    Yes if you're only running 4 cores on your 8 core Ryzen then your fine but Stata MP is using all 8, there could be a lawsuit.

    Now for you I'm sure they wouldn't care. For a larger firm with 10,000+ machines, then that's going to be a big lawsuit.
  • arashi - Wednesday, March 17, 2021 - link

    Some licenses charge for ALL cores, regardless of how many cores you would actually be using.
  • Casper42 - Monday, March 15, 2021 - link

    I'd really like to see you all test a 7543 to compare against the 75F3.
    If the Per Thread performance (Page 8) of that chip can beat the 7713, it might be a great option for VMware environments where folks want to stick to a single license/socket without needing the beastly 75F3
  • Casper42 - Monday, March 15, 2021 - link

    PS: I think it will also help come April and I hope you test multiple 32c offerings then too.
  • Olaf van der Spek - Monday, March 15, 2021 - link

    Why don't these parts boost to 4.5 - 5 GHz when using only one or two cores like the desktop parts?
  • ishould - Monday, March 15, 2021 - link

    Hoping to get an answer to this too
  • Calin - Tuesday, March 16, 2021 - link

    Basically if you have three servers at 50% load you shut one off and now deliver power to only two servers running at 75% load.
    An idle server will consume 100+ watts (as high idle power is not an issue for server farms) - so by running two servers at 75% versus three at 50% you basically save 100 watts.
    (in many cases, server farms are actually power - i.e. electrical energy delivery or cooling - limited).
  • coschizza - Monday, March 15, 2021 - link

    stability
  • Jon Tseng - Monday, March 15, 2021 - link

    Probably something to do with thermals + reliability - recall in the datacenter theres a bunch of server blades stuffed into racks. Plus they are running 24/7. Plus the cooling system isn't generally as robust as on a desktop (costs electricity to run). Bottom line is that server parts tend to run at lower clocks than desktop parts for a mix of all of these reasons.
  • Targon - Monday, March 15, 2021 - link

    Server processors are NOT workstations, they are not intended for tiny workloads where there might only be a few things going on at one time. if you want more cores but want to use the machine like a workstation, you go Threadripper.
  • yeeeeman - Monday, March 15, 2021 - link

    quite underwhelming tbh..
  • ballsystemlord - Monday, March 15, 2021 - link

    You expected? AMD has been overwhelming for years now, give them some slack. They can't do it every year.
  • eva02langley - Monday, March 15, 2021 - link

    You probably looking at the blue lines (Intel)... just saying...
  • Targon - Monday, March 15, 2021 - link

    Compared to what? Core count not increasing, but Zen3 is still a big improvement when it comes to IPC compared to Zen2.
  • mode_13h - Monday, March 15, 2021 - link

    We can hope that they find some microcode fixes to improve power allocation, and maybe a mid-cycle refresh with an updated I/O die.
  • Spunjji - Friday, March 19, 2021 - link

    How surprising, an Intel fanboy is unimpressed.
  • Wilco1 - Sunday, March 21, 2021 - link

    It's actually an impressive improvement. However Milan is getting power and memory bandwidth limited. It will take a new process and DDR5 to achieve significantly more performance.
  • ballsystemlord - Monday, March 15, 2021 - link

    Spelling and grammar errors:

    "As the first generation Naples was launched, it offered impressive some performance numbers."
    Rearange words:
    "As the first generation Naples was launched, it offered some impressive performance numbers."

    "All of these processors can be use in dual socket configurations."
    "used" not "use":
    "All of these processors can be used in dual socket configurations."

    "... I see these to chips as the better apples-to-apples generational comparison, ..."
    "two" not "to":
    "... I see these two chips as the better apples-to-apples generational comparison, ..."

    "There is always room for improvement, but if AMD equip themselves with a good IO update next generation,..."
    Missing "s":
    "There is always room for improvement, but if AMD equips themselves with a good IO update next generation,..."
  • eva02langley - Monday, March 15, 2021 - link

    If business don't buy EPYC by then, than they deserve all the issues coming with Intel CPUs.
  • Otritus - Monday, March 15, 2021 - link

    Milan's IO die really seems to be the Achilles heel of these CPUs. Perhaps AMD should have segregated the line up into superior memory performance and features Milan IO die and superior compute performance (but inferior features) Rome IO die.
  • Targon - Monday, March 15, 2021 - link

    The Zen4 generation will make the move to DDR5 memory, so new memory controller, socket, and other aspects. Also, as time goes on, the contracts with Global Foundries for how much they make for AMD will expire. As it stands now, the use of Global is entirely to fulfill the contracts and avoid paying any early termination fees.
  • Calin - Tuesday, March 16, 2021 - link

    TSMC still can not make enough chiplets (I think its production is sold until 2023).
    Using Global Foundry IO dies means AMD can make one 8+1 instead of 8 processors (or 4+1 instead of 4).
  • lejeczek - Monday, March 15, 2021 - link

    But those Altra Q80-33 ... gee guys. I have been thinking for a while - next upgrade of the stack in the rack might as well be...
  • mode_13h - Monday, March 15, 2021 - link

    Well, if it does well on the benchmarks that align with your workload, then I'd certainly consider at least a single-CPU Altra. IIRC, the multi-CPU interconnect was one of its weak points. You could even go dual-CPU, if you're provisioning VMs that fit on a single CPU (or better yet, just one quadrant).
  • Pinn - Monday, March 15, 2021 - link

    When does this filter to the Threadrippers?
  • mode_13h - Monday, March 15, 2021 - link

    Probably either when demand for the 3000-series Threadrippers starts slipping or if/when the supply of top-binned Zen3 dies ever catches up.

    It would be interesting to see what performance could be extracted from these CPUs, if AMD would raise the power/thermal limit another 100 W. Maybe the 5000-series TR Pro will be our chance to find out!
  • mode_13h - Monday, March 15, 2021 - link

    Someone please remind me why Altra's memory performance is so much stronger. Is it simply down to avoiding the cache write-miss penalty? I'm pretty sure x86 CPUs long-ago added store buffers to fix that, but I can't think of any other explanation for that incredible stream benchmark discrepancy!
  • Andrei Frumusanu - Monday, March 15, 2021 - link

    It's due to the Neoverse N1 cores being able to dynamically transform arbitrary memory writes into non-temporal write streams instead of doing regular RFO before a write as the x86 systems are currently doing. I explain it more in the Altra review:

    https://www.anandtech.com/show/16315/the-ampere-al...
  • mode_13h - Monday, March 15, 2021 - link

    That's more or less what I recall, but do you know it's *truly* emitting non-temporal stores? Those partially-bypass some or all of the cache hierarchy (I seem to recall that the Pentium 4 actually just restricted them to one set of L2 cache). It would seem to me that implausibly deep analysis would be needed for the CPU to determine that the core in question wouldn't access the data before it was replaced. And that's not even to speak of determining whether code running on *other* cores might need it.

    On the other hand, if it simply has enough write-buffering, it could avoid fetching the target cacheline by accumulating enough adjacent stores to determine that the entire cacheline would be overwritten. Of course, the downside would be a tiny bit more write latency, and memory-ordering constraints (esp. for x86) might mean that it'd only work for groups of consecutive stores to consecutive addresses.

    I guess a way to eliminate some of those restrictions would be to observe through analysis of the instruction stream that a group of stores would overwrite the cacheline and then issue an allocation instead of a fetch. Maybe that's what Altra is doing?
  • Andrei Frumusanu - Tuesday, March 16, 2021 - link

    You're over-complicating things. The core simply sees a stream pattern and switches over to nontemporal writes. They can fully saturate the memory controller when doing just pure write patterns.
  • mode_13h - Wednesday, March 17, 2021 - link

    But, do you know they're truly non-temporal writes? As I've tried to explain, there are ways to avoid the write-miss penalty without using true non-temporal writes.

    And how much of that are you inferring vs. basing this on what you've been told from official or unofficial sources?
  • Andrei Frumusanu - Saturday, March 20, 2021 - link

    It's 100% non-temporal writes, confirmed by both hardware tests and architects.
  • mode_13h - Saturday, March 20, 2021 - link

    Okay, thanks for confirming with them.
  • mode_13h - Saturday, March 20, 2021 - link

    It's not the easiest thing to confirm with a test, since you'd have to come along behind the writer and observe that a write that SHOULD still be in cache isn't.
  • CBeddoe - Monday, March 15, 2021 - link

    I'm excited by AMD's continuing design improvements.
    Can't wait to see what happens with the next node shrink. Intel has some catching up to do.
  • Ppietra - Tuesday, March 16, 2021 - link

    Can someone please explain how is it possible that the power consumption of the all package is so much higher than the power consumption of the actual cores doing the work?
  • Spunjji - Friday, March 19, 2021 - link

    Because the I/O die is running on an older 14nm process and is servicing all of the cores. In a 64-core CPU, the per-core power use of the I/O die is less than 2W. Still too much, of course, but in context not as obscene as it looks when you look at the total power.
  • Elstar - Tuesday, March 16, 2021 - link

    Lest it go unsaid, I really appreciate the "compile a big C++ project" benchmark (i.e. LLVM). Thank you!
  • Spunjji - Tuesday, March 16, 2021 - link

    "To that end, all we have to compare Milan to is Intel’s Cascade Lake Xeon Scalable platform, which was the same platform we compared Rome to."

    Says it all, really. Good work AMD, and cheers to the team for the review!
  • Hifihedgehog - Tuesday, March 16, 2021 - link

    Sysadmin: Ram? Rome?

    AMD: Milan, darling, Milan...
  • Ivan Argentinski - Tuesday, March 16, 2021 - link

    Congrats for going more in-depth for the per-core performance! For many enterprise buyers, this is the most (only?) important metric. I do suspect, that in this regard, the 8 core 72F3 will actually be the best 3rd gen EPYC!

    But to better understand this, we need more test and per-core comparisons. I would suggest comparing:
    * All current AMD fast/frequency optimized CPUs - EPYC 72F3, 73F3, ...
    * Previous gen AMD fast/frequency CPUs like EPYC 7F32, ...
    * Intel Frequency optimized CPUs like Xeon Gold 6250, 6244, ...

    The only metric that matters is per-core performance under full *sustained* load.

    Exploring the dynamic TDP of AMD EPYC 3rd gen is also an interesting option. For example, I am quite curious about configuring 72F3 with 200W instead of the default 180W.
  • Andrei Frumusanu - Saturday, March 20, 2021 - link

    If we get more SKUs to test, I'll be sure to do so.
  • aryonoco - Tuesday, March 16, 2021 - link

    Thanks for the excellent article Andrei and Ian. Really appreciate your work.

    Just wondering, is Johan no longer inlvolved in server reviews? I'll really miss him.
  • Andrei Frumusanu - Saturday, March 20, 2021 - link

    Johan is no longer part of AT.
  • SanX - Tuesday, March 16, 2021 - link

    In summary, the difference in performance 9 vs 8 for (Milan vs Rome) means they are EQUAL. Not a single specific application which shows more than that. So much for the many months of hype and blahblah.
  • tyger11 - Tuesday, March 16, 2021 - link

    Okay, now give us the new Zen 3 Threadripper Pro!
  • AusMatt - Wednesday, March 17, 2021 - link

    Page 4 text: "a 255 x 255 matrix" should read: "a 256 x 256 matrix".
  • hmw - Friday, March 19, 2021 - link

    What was the stepping for the Milan CPUs? B0? or B1?
  • mkbosmans - Saturday, March 20, 2021 - link

    These inter-core synchronisation latency plots are slightly misleading, or at least not representative of "real software". By fixing the cache line that is used to the first core in the system and then ping-ponging it between to other cores you do not measure core-core latency, but rather core-to-cacheline-to-core, as expressed in the article. This is not how inter-thread communication usually works (in well-designed software).
    Allocating the cache line on the memory local to one of the ping-pong threads would make the plot more informative (although a bit more boring).
  • mode_13h - Saturday, March 20, 2021 - link

    Are you saying a single memory address is used for all combinations of core x core?

    Ultimately, I wonder if it makes any difference which NUMA domain the address is in, for a benchmark like this. Once it's in L1 cache, that's what you're measuring, no matter the physical memory address.

    Also, I take issue with the suggestion that core-to-core communication necessarily involves memory in one of the core's NUMA domains. A lot of cases where real-world software is impacted by core-to-core latency involves global mutexes and atomic counters that won't necesarily be local to either core.
  • mkbosmans - Saturday, March 20, 2021 - link

    Yes, otherwise the SE quadrant (socket 2 to socket 2 communication) would look identical to the NW quadrant, right?

    It does matter on which NUMA node the address is in, this is exactly what is addressed later in the article about Xeon having a better cache coherency protocol where this is less of an issue.

    From the software side, I was more thinking of HPC applications where a situation of threads exchanging data that is owned by one of them is the norm, e.g. using OpenMP or MPI. That is indeed a different situation from contention on global mutexes.
  • mode_13h - Saturday, March 20, 2021 - link

    How often is MPI used for communication *within* a shared-memory domain? I tend to think of it almost exclusively as a solution for inter-node communication.
  • mkbosmans - Tuesday, March 23, 2021 - link

    Even if you have a nice two-tiered approach implemented in your software, let's say MPI for the distributed memory parallelization on top of OpenMP for the shared memory parallelization, it often turns out to be faster to limit the shared memory threads to a single socket of NUMA domain. So in case of an 2P EPYC configured as NPS4 you would have 8 MPI ranks per compute node.

    But of course there's plenty of software that has parallelization implemented using MPI only, so you would need a separate process for each core. This is often because of legacy reasons, with software that was originally targetting only a couple of cores. But with the MPI 3.0 shared memory extension, this can even today be a valid approach to great performing hybrid (shared/distributed mem) code.
  • mode_13h - Tuesday, March 23, 2021 - link

    Nice explanation. Thanks for following up!
  • Andrei Frumusanu - Saturday, March 20, 2021 - link

    This is vastly incorrect and misleading.

    The fact that I'm using a cache line spawned on a third main thread which does nothing with it is irrelevant to the real-world comparison because from the hardware perspective the CPU doesn't know which thread owns it - in the test the hardware just sees two cores using that cache line, the third main thread becomes completely irrelevant in the discussion.

    The thing that is guaranteed with the main starter thread allocating the synchronisation cache line is that it remains static across the measurements. One doesn't actually have control where this cache line ends up within the coherent domain of the whole CPU, it's going to end up in a specific L3 cache slice depended on the CPU's address hash positioning. The method here simply maintains that positioning to be always the same.

    There is no such thing as core-core latency because cores do not snoop each other directly, they go over the coherency domain which is the L3 or the interconnect. It's always core-to-cacheline-to-core, as anything else doesn't even exist from the hardware perspective.
  • mkbosmans - Saturday, March 20, 2021 - link

    The original thread may have nothing to do with it, but the NUMA domain where the cache line was originally allocated certainly does. How would you otherwise explain the difference between the first quadrant for socket 1 to socket 1 communication and the fourth quadrant for socket 2 to socket 2 communication?

    Your explanation about address hashing to determine the L3 cache slice may be makes sense when talking about fixing the inital thread within a L3 domain, but not why you want that that L3 domain fixed to the first one in the system, regardless of the placement of the two threads doing the ping-ponging.

    And about core-core latency, you are of course right, that is sloppy wording on my part. What I meant to convey is that roundtrip latency between core-cacheline-core and back is more relevant (at least for HPC applications) when the cacheline is local to one of the cores and not remote, possibly even on another socket than the two thread.
  • Andrei Frumusanu - Saturday, March 20, 2021 - link

    I don't get your point - don't look at the intra-remote socket figures then if that doesn't interest you - these systems are still able to work in a single NUMA node across both sockets, so it's still pretty valid in terms of how things work.

    I'm not fixing it to a given L3 in the system (except for that socket), binding a thread doesn't tell the hardware to somehow stick that cacheline there forever, software has zero say in that. As you see in the results it's able to move around between the different L3's and CCXs. Intel moves (or mirrors it) it around between sockets and NUMA domains, so your premise there also isn't correct in that case, AMD currently can't because probably they don't have a way to decide most recent ownership between two remote CCXs.

    People may want to just look at the local socket numbers if they prioritise that, the test method here merely just exposes further more complicated scenarios which I find interesting as they showcase fundamental cache coherency differences between the platforms.
  • mkbosmans - Tuesday, March 23, 2021 - link

    For a quick overview of how cores are related to each other (with an allocation local to one of the cores), I like this way of visualizing it more:
    http://bosmans.ch/share/naples-core-latency.png
    Here you can for example clearly see how the four dies of the two sockets are connected pairwise.

    The plots from the article are interesting in that they show the vast difference between the cc protocols of AMD and Intel. And the numbers from the Naples plot I've linked can be mostly gotten from the more elaborate plots from the article, although it is not entirely clear to me how to exactly extend the data to form my style of plots. That's why I prefer to measure the data I'm interested in directly and plot that.
  • imaskar - Monday, March 29, 2021 - link

    Looking at the shares sinking, this pricing was a miss...
  • mode_13h - Tuesday, March 30, 2021 - link

    Prices are a lot easier to lower than to raise. And as long as they can sell all their production allocation, the price won't have been too high.
  • Zone98 - Friday, April 23, 2021 - link

    Great work! However I'm not getting why in the c2c matrix cores 62 and 74 wouldn't have a ~90ns latency as in the NW socket. Could you clarify how the test works?
  • node55 - Tuesday, April 27, 2021 - link

    Why are the cpus not consistent?

    Why do you switch between 7713 and 7763 on Milan and 7662 and 7742 on Rome?

    Why do you not have results for all the server CPUs? This confuses the comparison of e.g. 7662 vs 7713. (My current buying decision )

Log in

Don't have an account? Sign up now