The top model beats Graviton 2 on SPECINT due to having *twice* the performance per core. It also has twice as much cache and memory bandwidth per core. And all that at 60W...
It's a monster and likely outperforms most servers it will be offloading!
For stateful packet processing, they really need as much cache as possible. The amount of context they can hold on chip can become a serious limiting factor. Because there's no SMT, the core is just sitting idle if you have to go off-chip for some connection-specific state.
Seems that Cavium's Thunder X2 and X3 had SMT-4. And yes, they did get bought by Marvell.
At that point, you could really reduce your OoO window and still probably get good utilization. Packet processing is one of those embarassingly parallel problems, so there should be no problem with workload scaling (or side-channel attacks, for that matter).
Oh well. Perhaps somebody else might have a go at it. Maybe we'll start to see some SMT RISC-V cores, especially now that Linux has "core scheduling" to better manage SMT sidechannel vulnerabilities.
Since this appears to be in a proprietary core, even Marvell could reverse course and drop in a SMT-based solution, at some point.
Interesting units, especially for 5G base stations and networking. Notice how they emphasize "fanless" operation in the slides! Curious how those compare (if at all) with Intel's x86-based offerings with some ML thrown in? Also, how do these compare with whatever Huawei has or had before they were booted from TSMC's advanced nodes?
I was just thinking about this. For some applications, no. This would tend to be running a highly-managed software stack. However, the nice thing about such an architecture is that you could run guest VMs and other sorts of software with higher likelihood of being malicious or exploitable to behave maliciously.
To help manage these risks, Linux now offers better policy control over which threads can share cores. So, you could limit core-sharing to threads of the same process or VM, for instance.
> For whatever reason, SMT seems to be unpopular in the ARM ecosystem,
Because ARM cores are traditionally comparatively small, the area-efficiency of SMT has been less.
ARM, itself, makes two SMT-2 cores (A65AE & A76AE), for 64-bit embedded applications. This is an implicit acknowledgement of the technical advantages of SMT. Embedded use-cases tend to be the ones with the least risk from side-channel attacks.
> as even Marvell themselves abandoned the SMT heavy ThunderX3.
I think that was simply because they weren't competitive with ARM's N2 cores.
I don't think Split-Lock capability in the Cortex-A76AE relies on SMT. Dual Core Lock-Step as the name suggests is a way of engaging two cores to raise the reliability of operations running on these specialised computing and control units.
Actually, I recall there was a second core besides the Cortex-A65AE from ARM with SMT - the Neoverse E1. Andrei pointed out that the E1 was derived from the Cortex-A65AE. At the time of the release of the E1 core ARM had thought it would be used for “throughput workloads that largely are...about shifting large amounts of data around" and that "are predominantly in the data plane". The Cortex-A65AE was said to be suited to streaming data from sensors whereas the E1 could support the streaming of data from the network in the case of infrastructure workloads. Evidently, with compute capability having become essential to DPUs - that is shown clearly by the Octeon 10 - the E1 may have been eclipsed in the role it was expected to play by ARM's N series Neoverse silicon.
The SMT-capable A65 core still seems interesting to me. It wouldn't shock me to see it (or something very much like it) put to good use beyond Automotive applications in more mainstream Cortex parts.
Cool, I had forgotten about the E1. Thanks for the follow-up.
This page includes a roadmap slide showing the E1, N2, and V1 all falling off a cliff labeled "Poseidon Generation", in 2022+. So, who knows if there'll be an E2 or whether it'll have any relation to the E1...
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
18 Comments
Back to Article
SarahKerrigan - Monday, June 28, 2021 - link
That looks like a really, really well-balanced accelerator.(I also appreciate that they didn't skimp on L3 like a lot of Neoverse designs have.)
Wilco1 - Monday, June 28, 2021 - link
The top model beats Graviton 2 on SPECINT due to having *twice* the performance per core. It also has twice as much cache and memory bandwidth per core. And all that at 60W...It's a monster and likely outperforms most servers it will be offloading!
SarahKerrigan - Monday, June 28, 2021 - link
At least for LLC, it's 4x the cache per core. Octeon10 is 2MB/core, Grav2 is 512KB/core.mode_13h - Monday, June 28, 2021 - link
For stateful packet processing, they really need as much cache as possible. The amount of context they can hold on chip can become a serious limiting factor. Because there's no SMT, the core is just sitting idle if you have to go off-chip for some connection-specific state.mode_13h - Monday, June 28, 2021 - link
...and the usual trick of masking it with hardware prefetchers won't work at all, because they can't know which connection the next packet belongs to.brucethemoose - Friday, July 2, 2021 - link
Isn't Marvell the one who made a 4 or 8-way SMT ARM core, then abandoned it, presumably because it was too niche?This seems like a perfect use case. A shame it had to go...
mode_13h - Saturday, July 3, 2021 - link
Seems that Cavium's Thunder X2 and X3 had SMT-4. And yes, they did get bought by Marvell.At that point, you could really reduce your OoO window and still probably get good utilization. Packet processing is one of those embarassingly parallel problems, so there should be no problem with workload scaling (or side-channel attacks, for that matter).
Oh well. Perhaps somebody else might have a go at it. Maybe we'll start to see some SMT RISC-V cores, especially now that Linux has "core scheduling" to better manage SMT sidechannel vulnerabilities.
Since this appears to be in a proprietary core, even Marvell could reverse course and drop in a SMT-based solution, at some point.
mode_13h - Monday, June 28, 2021 - link
> Nvidia’s BlueField3 DPU design that still “only” features Cortex-A78 coresI didn't believe this, given its projected launch date, but it turns out to be right in the GTC 2021 keynote slides!
https://images.anandtech.com/doci/16611/17056937.j...
eastcoast_pete - Tuesday, June 29, 2021 - link
Interesting units, especially for 5G base stations and networking. Notice how they emphasize "fanless" operation in the slides! Curious how those compare (if at all) with Intel's x86-based offerings with some ML thrown in? Also, how do these compare with whatever Huawei has or had before they were booted from TSMC's advanced nodes?mode_13h - Tuesday, June 29, 2021 - link
Should be: https://en.wikipedia.org/wiki/HiSilicon#Kunpeng_93...For more, see: https://fuse.wikichip.org/news/2274/huawei-expands...
I don't know if any of this stuff is (still) accurate.
mode_13h - Friday, July 2, 2021 - link
I'm not entirely sold on the concept of vector packet processing. I wonder if they really wouldn't just be better off with >= 4-way SMT.brucethemoose - Friday, July 2, 2021 - link
Would security be a concern with SMT?For whatever reason, SMT seems to be unpopular in the ARM ecosystem, as even Marvell themselves abandoned the SMT heavy ThunderX3.
In fact, wasn't the TX2 processor based on ThunderX2, which was also a SMT4 design?
mode_13h - Saturday, July 3, 2021 - link
> Would security be a concern with SMT?I was just thinking about this. For some applications, no. This would tend to be running a highly-managed software stack. However, the nice thing about such an architecture is that you could run guest VMs and other sorts of software with higher likelihood of being malicious or exploitable to behave maliciously.
To help manage these risks, Linux now offers better policy control over which threads can share cores. So, you could limit core-sharing to threads of the same process or VM, for instance.
> For whatever reason, SMT seems to be unpopular in the ARM ecosystem,
Because ARM cores are traditionally comparatively small, the area-efficiency of SMT has been less.
ARM, itself, makes two SMT-2 cores (A65AE & A76AE), for 64-bit embedded applications. This is an implicit acknowledgement of the technical advantages of SMT. Embedded use-cases tend to be the ones with the least risk from side-channel attacks.
> as even Marvell themselves abandoned the SMT heavy ThunderX3.
I think that was simply because they weren't competitive with ARM's N2 cores.
mode_13h - Saturday, July 3, 2021 - link
> ... the area-efficiency of SMT has been less.I meant the benefit in area-efficiency vs. simply adding more cores.
Also, I think the raft of recent side-channel vulnerabilities has given SMT an image problem and reduced customer demand for the feature.
ChrisGX - Sunday, July 4, 2021 - link
I don't think Split-Lock capability in the Cortex-A76AE relies on SMT. Dual Core Lock-Step as the name suggests is a way of engaging two cores to raise the reliability of operations running on these specialised computing and control units.mode_13h - Sunday, July 4, 2021 - link
The split-lock functionality seems distinct from the SMT capability.https://www.anandtech.com/show/13727/arm-announces...
I'm not certain the A76AE is SMT-capable, however. That might've been some bad info I found.
ChrisGX - Monday, July 5, 2021 - link
Actually, I recall there was a second core besides the Cortex-A65AE from ARM with SMT - the Neoverse E1. Andrei pointed out that the E1 was derived from the Cortex-A65AE. At the time of the release of the E1 core ARM had thought it would be used for “throughput workloads that largely are...about shifting large amounts of data around" and that "are predominantly in the data plane". The Cortex-A65AE was said to be suited to streaming data from sensors whereas the E1 could support the streaming of data from the network in the case of infrastructure workloads. Evidently, with compute capability having become essential to DPUs - that is shown clearly by the Octeon 10 - the E1 may have been eclipsed in the role it was expected to play by ARM's N series Neoverse silicon.The SMT-capable A65 core still seems interesting to me. It wouldn't shock me to see it (or something very much like it) put to good use beyond Automotive applications in more mainstream Cortex parts.
https://www.anandtech.com/show/13959/arm-announces...
mode_13h - Wednesday, July 7, 2021 - link
Cool, I had forgotten about the E1. Thanks for the follow-up.This page includes a roadmap slide showing the E1, N2, and V1 all falling off a cliff labeled "Poseidon Generation", in 2022+. So, who knows if there'll be an E2 or whether it'll have any relation to the E1...
https://www.anandtech.com/show/16640/arm-announces...