"Meanwhile, such a major memory bandwidth improvement indicates that the new processor employs a memory subsystem with a higher number of channels compared to Graviton3, though AWS has not formally confirmed this."
"Given that Amazon did not reveal many details about its Graviton4, it is hard to attribute performance increases to any particular characteristics of the CPU.
Yet, NextPlatform believes that the processor uses Arm Neoverse V2 cores, which are more capable than V1 cores used in previous-generation AWS processors when it comes to instruction per clock (IPC). Furthermore, the new CPU is expected to be fabricated using one of TSMC's N4 process technologies (4nm-class), which offers a higher clock-speed potential than TSMC's N5 nodes."
I don't know why the writer is acting like the uarch and memory are some kind of mystery worthy of speculation. Amazon says outright that it's Neoverse V2 and has 12-channel DDR5-5600.
It's great, abstractly, if these are power efficient, but since you can't buy these it only matters to Jeff Bezos. For the rest of us, we only care about the sticker price.
That's why statements like this: "Nowadays many cloud service providers design their own silicon, but Amazon Web Services (AWS) started to do this ahead of its rivals and by now its Annapurna Labs develops processors that can well compete with those from AMD and Intel." Are kind of a huh.
Also, the 30% number seems to refer to the entire CPU. For a CPU with 50% more cores, then switching to the N2 seems in line with having only a 30% speedup. If they meant per-core performance improved by 30%, that's more than I'd expect for the gen-on-gen difference between Neoverse-V cores.
I guess what makes 30% plausible as a single-thread speedup number would be if they increased the clock speed significantly. Give that the V2 allegedly delivers 13% better IPC than V1, that leaves 15% to be accounted for (1.13 * 1.15 = 1.3).
If this is achieved by increasing the clock speed from Graviton 3's 2.6 GHz to 3.0 GHz, then it works out perfectly. However, that's a little at odds with their efficiency goals. NextPlatform instead worked backwards from the claimed power budget to arrive at a clock speed of only 2.7 GHz. If true, there's still about 11% that's unaccounted for. NextPlatform suggests this might've been accomplished via additional cache.
Yes, I saw your link. Thanks for sharing it. The point of my comment was just to explain why Anandtech, NextPlatform, and others tried to speculate - they were working off less information than you found.
One aspect the AWS blog entry doesn't comment on that the Nextplatform one speculates on is that Graviton 4 might be SMT enabled. On a somewhat related note, Ampere is currently dealing with/trying to deal with a limitation for ARM cores in the Linux kernel, that right now can only handle 256 physical cores. While Ampere's new CPU has "only" 192 cores, their new CPU is also capable of dual-socket operation. Except that, right now, doing so would exceed the kernel limit of the very OS that most ARM-based servers use. If Graviton 4 does indeed have SMT, it would sidestep that issue and still allow two socket operation while offering more than 256 threads. Of course, among the first ARM-based server CPUs to use SMT was/is actually the one from Huawei/HiSilicon. They now also use that SMT-capable core in their Smartphone SoC for the Chinese market. AFAIK, right now the only smartphone SoC to have SMT.
Neat, that's nearly 70% the memory bandwidth of a 2021 Mac Studio (M1 Ultra).
I know, it's a facetious comparison, but I find it curious either way you look at it - whether that server CPU makers just can't seem to beat Apple's barely-more-than-a-laptop-CPU, or that Apple so massively over-designed their M-series processors in this regard.
Yes, and no. Of course it is true that GPUs are bandwidth-hungry. But it's also true that data centers are notoriously underprovisioned with bandwidth. Dick Sites (one of the Google performance engineers) has frequently complained about this.
I suspect it costs money to provide bandwidth (not to mention designing your machine differently, eg without using standard motherboards and DIMMs), and that this is part of what you are getting when you pay for Apple (in spite of the loud voices who claim that there are no advantages to on-SoC RAM...)
The GPU comment was made to explain why Apple's client SoC has so much bandwidth. The reason why server CPUs can't easily do the same is due to their memory scalability requirements.
Of course, there are exceptions. Intel's Xeon Max has HBM, which can be used as a "cache", to avoid compromising on scalability. Nvidia's Grace uses on-package LPDDR5X, I guess with a memory scaling strategy of either adding more nodes. Maybe, in the future, they plan on pools of additional memory being accessed over CXL. Long-term, we seem to be headed for memory tiers, where one form or another of on-package memory comprises the fast tier.
Anyway, if you're curious how a server CPU would perform with 1 TB/s of bandwidth:
M-series chips are using LPDDR memories, which are aimed at high bandwidth at the cost of latency. IFRC the latency of M1 is ~100ns and zen3 chips were capable of ~60ns latency. Considering you need to do 1 fetch, 1 decode before any of the bandwidth is usable for numerical calculation.
I'd say LPDDR is more like GDDR rather conventional DDR.
Most of the latency to DRAM is in the traversal through caches and on the NoC. The part that's really specific to DRAM latency is surprisingly similar across different DRAM technologies.
Apple and AMD (and Intel) make different tradeoffs about the latency of this cache traversal and NoC. Apple optimize for low power, and get away with it because their caches are much more sophisticated (hit more often) and they do a better job of hiding DRAM latency within the core.
I think most of the latency is actually the column address time. LPDDR chips runs at lower voltage for higher bus speed, but sacrifice the row buffer charge time. Typical trade off between PPA
> Most of the latency to DRAM is in the traversal through caches and on the NoC.
Are you sure about that? Here, we see the latency of GDDR6 and GDDR6X running in the range of 227 to 269 ns, which is more than a factor of 2 greater than DRAM latency usually runs, even for server CPUs.
On an otherwise idle GPU, I really can't imagine why it would take so long to traverse its cache hierarchy and on-die interconnect. Not only that, but the RTX 4000 GPUs have just 2 levels of cache, in contrast to modern CPUs' 3-level cache hierarchy.
If you look at the paper (and understand it...) you will see that across a wide range of DRAM technologies the (kinda) best case scenario from when a request "leaves the last queue" until when the data is available is about 30ns. The problem is dramatic variation in when "leaving that last queue" occurs. WITHIN DRAM technologies, schemes that provide multiple simultaneously serviced queues can dramatically reduce the queueing delay; across SoC designs, certain schemes can dramatically reduce (or increase) the delay from "execution unit" to "memory controller".
In the case of a GPU, for example, a large part of what the GPU wants to do is aggregate memory requests so that if successive lanes of a warp, or successive warps, reference successive addresses, this can be converted into a single long request to memory. There's no incentive to move requests as fast as possible from execution to DRAM; on the contrary they are kept around for as long as feasible in the hopes that aggregation increases.
Thanks for the paper. I noticed it just cover GDDR5. I tried looking at the GDDR6 spec, but didn't see much to warrant a big change other than the bifurcation of the interface down to 16-bit. Even that doesn't seem like enough to add more than a nanosecond or so per 64B burst.
> There's no incentive to move requests as fast as possible from execution to DRAM; > on the contrary they are kept around for as long as feasible in the hopes that > aggregation increases
That makes sense for writes, but not reads (which is what Chips&Cheese measured). For reads, you'd just grab a cacheline, as that's already bigger than a typical compressed texture block. Furthermore, in rendering, reads tend to be scattered, while writes tend to be fairly coherent. So, write-combining makes sense, but read-combining doesn't.
Also, write latency can (usually) be hidden from software by deep queues, but read latency can't. Even though GPUs' SMT can hide read latency, that depends on having lots of concurrency and games don't always have sufficient shader occupancy to accomplish this. Plus, the more read latency you need to hide, the more warps/wavefronts you need, which translates into needing more registers. So, SMT isn't free -- you don't want more of it than necessary.
...the world's first Armv9-based 2-way server design...
Hmm, maybe dual socket rather than 2-way. I believe there are already a handful of 2-or-more CPU packages for ARM wandering around out there in the wider world that would suffice to meet the 2-way criteria.
I'm pretty sure Nvidia's Grace already achieved that. You can put two of them on a "Superchip" module, as Nvidia is keen to point out in their zeal to quote the largest possible number (i.e. 144 cores and 960 GB/s memory bandwidth - both of which are only possible on a dual-CPU module).
it would be useful to know what they mean by 'database performance'. most RDBMS implementations these days are still COBOL/java code that use the 'database' as just a convenient holder of flat-files (easy to load and backup, mostly), doing endless sequential reads and writes. show me 3NF or 5NF comparative performance?
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
35 Comments
Back to Article
SarahKerrigan - Wednesday, November 29, 2023 - link
"Meanwhile, such a major memory bandwidth improvement indicates that the new processor employs a memory subsystem with a higher number of channels compared to Graviton3, though AWS has not formally confirmed this.""Given that Amazon did not reveal many details about its Graviton4, it is hard to attribute performance increases to any particular characteristics of the CPU.
Yet, NextPlatform believes that the processor uses Arm Neoverse V2 cores, which are more capable than V1 cores used in previous-generation AWS processors when it comes to instruction per clock (IPC). Furthermore, the new CPU is expected to be fabricated using one of TSMC's N4 process technologies (4nm-class), which offers a higher clock-speed potential than TSMC's N5 nodes."
https://aws.amazon.com/blogs/aws/join-the-preview-...
I don't know why the writer is acting like the uarch and memory are some kind of mystery worthy of speculation. Amazon says outright that it's Neoverse V2 and has 12-channel DDR5-5600.
bwj - Wednesday, November 29, 2023 - link
It's great, abstractly, if these are power efficient, but since you can't buy these it only matters to Jeff Bezos. For the rest of us, we only care about the sticker price.trevor23 - Wednesday, November 29, 2023 - link
People and companies who care about sustainability care about power consumption.Threska - Wednesday, November 29, 2023 - link
That's why statements like this:"Nowadays many cloud service providers design their own silicon, but Amazon Web Services (AWS) started to do this ahead of its rivals and by now its Annapurna Labs develops processors that can well compete with those from AMD and Intel."
Are kind of a huh.
mode_13h - Thursday, November 30, 2023 - link
> It's great, abstractly, if these are power efficient,It should translate into lower pricing, because it costs money both to power the CPUs and then remove the waste heat.
mode_13h - Thursday, November 30, 2023 - link
> I don't know why the writer is acting like the uarch and memory are some kind of mysteryBecause the link cited by the article is different than yours. It doesn't say what kind of core they have:
https://press.aboutamazon.com/2023/11/aws-unveils-...
Also, the 30% number seems to refer to the entire CPU. For a CPU with 50% more cores, then switching to the N2 seems in line with having only a 30% speedup. If they meant per-core performance improved by 30%, that's more than I'd expect for the gen-on-gen difference between Neoverse-V cores.
mode_13h - Thursday, November 30, 2023 - link
I guess what makes 30% plausible as a single-thread speedup number would be if they increased the clock speed significantly. Give that the V2 allegedly delivers 13% better IPC than V1, that leaves 15% to be accounted for (1.13 * 1.15 = 1.3).If this is achieved by increasing the clock speed from Graviton 3's 2.6 GHz to 3.0 GHz, then it works out perfectly. However, that's a little at odds with their efficiency goals. NextPlatform instead worked backwards from the claimed power budget to arrive at a clock speed of only 2.7 GHz. If true, there's still about 11% that's unaccounted for. NextPlatform suggests this might've been accomplished via additional cache.
SarahKerrigan - Thursday, November 30, 2023 - link
Right, but I'm not making this stuff up. The link I provided is first-party, from Amazon. It has V2. That's not an area of ambiguity.mode_13h - Thursday, November 30, 2023 - link
> Right, but I'm not making this stuff up.Yes, I saw your link. Thanks for sharing it. The point of my comment was just to explain why Anandtech, NextPlatform, and others tried to speculate - they were working off less information than you found.
eastcoast_pete - Sunday, December 3, 2023 - link
One aspect the AWS blog entry doesn't comment on that the Nextplatform one speculates on is that Graviton 4 might be SMT enabled. On a somewhat related note, Ampere is currently dealing with/trying to deal with a limitation for ARM cores in the Linux kernel, that right now can only handle 256 physical cores. While Ampere's new CPU has "only" 192 cores, their new CPU is also capable of dual-socket operation. Except that, right now, doing so would exceed the kernel limit of the very OS that most ARM-based servers use. If Graviton 4 does indeed have SMT, it would sidestep that issue and still allow two socket operation while offering more than 256 threads. Of course, among the first ARM-based server CPUs to use SMT was/is actually the one from Huawei/HiSilicon. They now also use that SMT-capable core in their Smartphone SoC for the Chinese market. AFAIK, right now the only smartphone SoC to have SMT.mode_13h - Monday, December 4, 2023 - link
I've never heard of Neoverse V2 supporting SMT. You'd think ARM would've mentioned that, when they announced it.> among the first ARM-based server CPUs to use SMT was/is actually
> the one from Huawei/HiSilicon.
There were other ARM cores with SMT. The Cortex-A65 was used in a couple self-driving SoCs and supports 2-way SMT.
In terms of server CPUs, the Neoverse E1 supposedly has it. Cavium/Marvell's ThunderX2 & ThunderX3 have 4-way SMT.
29a - Monday, December 4, 2023 - link
"I don't know why the writer is acting like the uarch and memory are some kind of mystery worthy of speculation."Because he's terrible, AI could write much better articles.
Wadiest - Wednesday, November 29, 2023 - link
Neat, that's nearly 70% the memory bandwidth of a 2021 Mac Studio (M1 Ultra).I know, it's a facetious comparison, but I find it curious either way you look at it - whether that server CPU makers just can't seem to beat Apple's barely-more-than-a-laptop-CPU, or that Apple so massively over-designed their M-series processors in this regard.
bubblyboo - Wednesday, November 29, 2023 - link
Because M# processor memory bandwidth is shared between the CPU and GPU, whereas here it's only the CPU bandwidth.mode_13h - Thursday, November 30, 2023 - link
Yup. Comparing CPU + GPU vs. CPU-only. GPUs are notoriously bandwidth-hungry.name99 - Friday, December 1, 2023 - link
Yes, and no.Of course it is true that GPUs are bandwidth-hungry. But it's also true that data centers are notoriously underprovisioned with bandwidth. Dick Sites (one of the Google performance engineers) has frequently complained about this.
I suspect it costs money to provide bandwidth (not to mention designing your machine differently, eg without using standard motherboards and DIMMs), and that this is part of what you are getting when you pay for Apple (in spite of the loud voices who claim that there are no advantages to on-SoC RAM...)
mode_13h - Saturday, December 2, 2023 - link
The GPU comment was made to explain why Apple's client SoC has so much bandwidth. The reason why server CPUs can't easily do the same is due to their memory scalability requirements.Of course, there are exceptions. Intel's Xeon Max has HBM, which can be used as a "cache", to avoid compromising on scalability. Nvidia's Grace uses on-package LPDDR5X, I guess with a memory scaling strategy of either adding more nodes. Maybe, in the future, they plan on pools of additional memory being accessed over CXL. Long-term, we seem to be headed for memory tiers, where one form or another of on-package memory comprises the fast tier.
Anyway, if you're curious how a server CPU would perform with 1 TB/s of bandwidth:
https://www.phoronix.com/review/xeon-max-ubuntu-23...
lemurbutton - Wednesday, November 29, 2023 - link
Why are you comparing a server CPU to a consumer SoC?erinadreno - Wednesday, November 29, 2023 - link
M-series chips are using LPDDR memories, which are aimed at high bandwidth at the cost of latency. IFRC the latency of M1 is ~100ns and zen3 chips were capable of ~60ns latency. Considering you need to do 1 fetch, 1 decode before any of the bandwidth is usable for numerical calculation.I'd say LPDDR is more like GDDR rather conventional DDR.
mode_13h - Thursday, November 30, 2023 - link
LPDDR is about low-power. You can hit the same bandwidth numbers using DDR5, but at considerably higher power.TheinsanegamerN - Friday, December 1, 2023 - link
I am not aware of any mass production DDR5 PCs with 9600 mhz memory standard. LPDDR, OTOH......mode_13h - Friday, December 1, 2023 - link
That's not simply LPDDR5, but rather LPDDR5T.DDR5 has its own variants, like MCR and some similar technique that AMD is pursuing.
name99 - Friday, December 1, 2023 - link
Most of the latency to DRAM is in the traversal through caches and on the NoC. The part that's really specific to DRAM latency is surprisingly similar across different DRAM technologies.Apple and AMD (and Intel) make different tradeoffs about the latency of this cache traversal and NoC.
Apple optimize for low power, and get away with it because their caches are much more sophisticated (hit more often) and they do a better job of hiding DRAM latency within the core.
erinadreno - Saturday, December 2, 2023 - link
I think most of the latency is actually the column address time. LPDDR chips runs at lower voltage for higher bus speed, but sacrifice the row buffer charge time. Typical trade off between PPAmode_13h - Saturday, December 2, 2023 - link
> Most of the latency to DRAM is in the traversal through caches and on the NoC.Are you sure about that? Here, we see the latency of GDDR6 and GDDR6X running in the range of 227 to 269 ns, which is more than a factor of 2 greater than DRAM latency usually runs, even for server CPUs.
https://chipsandcheese.com/2022/11/02/microbenchma...
On an otherwise idle GPU, I really can't imagine why it would take so long to traverse its cache hierarchy and on-die interconnect. Not only that, but the RTX 4000 GPUs have just 2 levels of cache, in contrast to modern CPUs' 3-level cache hierarchy.
name99 - Monday, December 4, 2023 - link
This is covered in https://user.eng.umd.edu/~blj/papers/memsys2018-dr...If you look at the paper (and understand it...) you will see that across a wide range of DRAM technologies the (kinda) best case scenario from when a request "leaves the last queue" until when the data is available is about 30ns. The problem is dramatic variation in when "leaving that last queue" occurs. WITHIN DRAM technologies, schemes that provide multiple simultaneously serviced queues can dramatically reduce the queueing delay; across SoC designs, certain schemes can dramatically reduce (or increase) the delay from "execution unit" to "memory controller".
In the case of a GPU, for example, a large part of what the GPU wants to do is aggregate memory requests so that if successive lanes of a warp, or successive warps, reference successive addresses, this can be converted into a single long request to memory.
There's no incentive to move requests as fast as possible from execution to DRAM; on the contrary they are kept around for as long as feasible in the hopes that aggregation increases.
mode_13h - Monday, December 4, 2023 - link
Thanks for the paper. I noticed it just cover GDDR5. I tried looking at the GDDR6 spec, but didn't see much to warrant a big change other than the bifurcation of the interface down to 16-bit. Even that doesn't seem like enough to add more than a nanosecond or so per 64B burst.> There's no incentive to move requests as fast as possible from execution to DRAM;
> on the contrary they are kept around for as long as feasible in the hopes that
> aggregation increases
That makes sense for writes, but not reads (which is what Chips&Cheese measured). For reads, you'd just grab a cacheline, as that's already bigger than a typical compressed texture block. Furthermore, in rendering, reads tend to be scattered, while writes tend to be fairly coherent. So, write-combining makes sense, but read-combining doesn't.
Also, write latency can (usually) be hidden from software by deep queues, but read latency can't. Even though GPUs' SMT can hide read latency, that depends on having lots of concurrency and games don't always have sufficient shader occupancy to accomplish this. Plus, the more read latency you need to hide, the more warps/wavefronts you need, which translates into needing more registers. So, SMT isn't free -- you don't want more of it than necessary.
dotjaz - Thursday, November 30, 2023 - link
And it only has 53% the bandwidth of a 2019 mid-high end GPU (RADEON VII)mode_13h - Thursday, November 30, 2023 - link
Again, comparing on-package memory (i.e. HBM2) vs. DDR5 RDIMMs.Intel's Xeon Max gets the same 1 TB/s as Radeon VII, but it's limited to just 64 GB of HBM.
mode_13h - Thursday, November 30, 2023 - link
Try putting a couple TB of RAM in that Mac Studio.Oops, that's right! You can't add any RAM at all! You're stuck with just the 128 GB that Apple could fit on package.
The point of servers is they're *scalable*. Same goes for PCIe.
Dante Verizon - Thursday, November 30, 2023 - link
The performance increase was modest considering the massive increase in core count.PeachNCream - Thursday, November 30, 2023 - link
...the world's first Armv9-based 2-way server design...Hmm, maybe dual socket rather than 2-way. I believe there are already a handful of 2-or-more CPU packages for ARM wandering around out there in the wider world that would suffice to meet the 2-way criteria.
mode_13h - Thursday, November 30, 2023 - link
> world's first Armv9-based 2-way server designI'm pretty sure Nvidia's Grace already achieved that. You can put two of them on a "Superchip" module, as Nvidia is keen to point out in their zeal to quote the largest possible number (i.e. 144 cores and 960 GB/s memory bandwidth - both of which are only possible on a dual-CPU module).
Arnulf - Friday, December 1, 2023 - link
"Therefore, to offer 192 vCPU cores, Amazon will need to ... enable three-way SMT on a 96-core CPU"Doesn't compute.
FunBunny2 - Saturday, December 2, 2023 - link
it would be useful to know what they mean by 'database performance'. most RDBMS implementations these days are still COBOL/java code that use the 'database' as just a convenient holder of flat-files (easy to load and backup, mostly), doing endless sequential reads and writes. show me 3NF or 5NF comparative performance?