Since the short traces seem to work so well for Apple, should GDDR7 users be thinking about on-package memory? Its not like they are losing modularity moving away from DIMMs, and AMD is *already* using a multi chip package.
I think GDDR memory runs a fair bit hotter, making it not so amenable to stacking like what Apple and Nvidia have done to put LPDDR5 in-package. And if you don't stack it, then I think the package would be far too big.
As @kabushawarib said, if you need high bandwidth and want to go in-package, then you probably have no better option than HBM.
I'm right there with you, but it's not in the cards. HBM is a premium memory solution, with a price to match. It's going to remain solely in the domain of high margin server parts.
Fury X was 2015. Radeon VII was 2019. The only reason we don't see HBM on high end GPUs, which have 2x'd in price since those cards, is because Nvidia doesn't give a shit about its gaming products beyond milking consumers for Datacenter/AI R&D money.
The 4000 series cards are regressing in memory bandwidth compared to the previous generation. Nvidia are substituting in hacky, stretch-and-blur upscaling and "120 Hz TV" frame interpolation for faster hardware and actual rendering improvements at a higher price.
The board level complexity wouldn’t change as HBM is on-package. Yes there is a cost there but with the market moving to chiplets anyway at the high-end, many of the complexities are already being made. One or two low capacity stacks of HBM would serve as a nice scratch space for a buffer in raster workloads while GDDRx memory is used for textures.
At the board level, things might become easier if the inclusion of HBM on the package would permit lowering the bus width. In fact, it’d be interesting to see DIMM slots appear on a GPU for memory expansion and holding large data sets. Bonus if such a card supported CXL so it could actively extend memory capacity on the CPU side.
So basically you understand nothing. It's literally cheaper and MUCH MUCH simpler to just stack up 3D Vcache than what you've suggested.
Nobody cares about board level complexity. And it'd NOT be interesting to see DIMM on GPU. What's the point? The bandwidth is so abysmal you might as well just use system memory via PCIe 5.0.
> it’d be interesting to see DIMM slots appear on a GPU for > memory expansion and holding large data sets.
This idea is DoA, in the era of CXL. A CXL x16 link is about as fast as 64-bit DDR5-8000. That lets you put your large dataset in main memory or off on some other CXL.mem device. In either case, it could be shared in a cache-coherent way with other GPUs and CPUs in the system.
GDDR7 sounds nice and all, but unless it is used as unified memory, it feels like it's a bit wasted. By that, I mean PS6 and next gen Xbox, and hopefully switch2.
More bandwidth per pin lets you get more performance with a narrower memory bus. In RDNA 2, AMD got by with the top spec GPU having only 256-bit, but they had to go up to 384-bit for RDNA 3. If memory speeds to continue to increase, we might see them getting up to 512-bit, again. And that's expensive.
So, if you care about perf/$, then it's in your interest to see GDDR speeds continue to increase.
It all comes down to architecture. AMD has seemed to historically favor a faster, wide memory bus, while nVidia has historically been less sensitive to the memory subsystem. Most AMD architectures are memory bandwidth starved while nVidia architectures are clock (effectively power\current) starved. GCN and Maxwell are two great examples of this; GCN-based Fiji cards brought HBM while Maxwell brought large L2 cache with a tremendously reduced memory bus, most cards having 128-bit, 192-bit, 224-bit or 256-bit opposed to AMD having a mainstream 256-bit to 384-bit, with 4096-bit at the high end.
* I think you're over-generalizing. Don't try to fit trends, but rather look at it case-by-case. * Maxwell indeed went up to 384-bit, which it needed to counter Fury. * Even then, Maxwell could only tie Fury by switching to tile-based rendering. * Fury had 4096-bit, but the memory clock was just 1/6th of its GDDR5-based brethren. * Vega only had 2048-bit, at a higher clock. Still, lower bandwidth than Fury. * Vega 20 had 4096-bit, at an even higher clock, but didn't increase shaders or ROPs from Vega. * Pascal further increased bandwidth efficiency by improving texture compression. * RDNA 2 @ 256-bit was able to counter RTX 3000 @ 384-bit, by using Infinity Cache. * RTX 4000 countered by increasing its L2 cache to comparable levels, will maintaining 384-bit.
HBM was also less powerhungry, which AMD needed since their silicon was less efficient than Nvidias at that time, and it was an early bet on new technology, they were clearly betting on a more rapid uptake of HBM and lower future prices. HBM might have been a goner without AMDs bet on it which would have been a shame.
I think this just shows how much effort Nvidia put into delta color compression (DCC). By Pascal, Nvidia were able to compress a large majority of the screenspace, saving a ton of bandwidth in the process. AMD was also onboard the DCC train with Fiji/Tonga, and improved it again in Polaris. By Vega 10/Vega 20, it wasn’t external memory bandwidth holding these architectures back, rather registers and local caches at the CUs that were under increasing pressures, as well as poor overall geometry performance in graphics workloads. RDNA sought to fix those issues and did so.
With the explosion of compute performance in high-end GPUs and also ray tracing, memory bandwidth is again a major limitation across the die, usually at the SM or CU level, as they expend registers and local caches processing ever larger datasets. Nvidia would not have dedicated so much die space to a very large 96MB L2 cache (full-die AD102 maximum) if it didn’t have significant benefits. Same for AMD in RDNA2 at 128MB L3 in Navi 21.
> By Vega 10/Vega 20, it wasn’t external memory bandwidth holding these architectures back
Um, the gains made by Vega 20 were primarily due to increasing memory bandwidth by more than 2x. That, and a clock speed bump. No microarchitecture changes, though.
But, Vega 20 was not made for gaming. Rather, it was destined for compute workloads, with a full contingent of fp64 and newly-added packed arithmetic instructions targeting AI.
> RDNA sought to fix those issues and did so.
Indeed, it significantly closed the gap between AMD GPUs' on-paper vs. real-world performance.
> With the explosion of compute performance in high-end GPUs and also > ray tracing, memory bandwidth is again a major limitation across the die
If you just look at RDNA2 vs. RTX 3000, it seems pretty clear that AMD's outsized performance was largely thanks to Infinity Cache. Nvidia's resurgence in RTX 4000 suspiciously coincides with them adding massive amounts of L2 cache. These data points suggest that memory bandwidth is very much a live issue, even for raster performance.
Unfortunately, AMD took a step back on cache size in RDNA3. I hope we'll see a 3D V-Cache version of Navi 31 that steps it up to 192 MB or more. How nuts would it be if they even manage to reuse the very same V-Cache dies they're using in the Ryzen 7000X3D CPUs?
PAM3 enables not 1.5, but rather log_2(3) bits per cycle (which can be rounded down to 1.5, but still); the fact that 1.5 equals to 3/2 might mislead the reader with regards to the way how information density is calculated.
They're only using 8/9 possible symbols so 1.5 not 1.584... is the correct value. 0/0 isn't used presumably because they'd then need something else to indicate pathalogical data transmitted vs no signal.
Now it makes sense. Why Nvidia top end GA102 cards (RTX3090Ti) and AD102 cards utilize ECC. In NVCP you can set the ECC on or off. HWbot started enforcing it for all the masses but for their own they allow it. As ECC enabling means less Memory performance, overall dragging the top scores. Ultimately PAM4 with such high clock / data rate = Errors / Noise so ECC was the solution that Nvidia implemented.
G6X is really a beta test vehicle for all of Ampere. In fact entire Ampere lineup is a beta. MLCC reference shenanigans, then the Amazon game burning up cards, then the RTX3090 VRAM disaster which Nv says it's fine, Datasheets have 118C still it's way too high. Only card worth buying in Ampere is 3090Ti as they use 2Gb modules of Micron G6X and runs at full speed as intended without all this drama plus has ECC, and ofc it's not available anymore.
Moving along, I hope Samsung makes a comeback to Memory with GDDR7. Micron Memory is not really top end for GDDR. For NAND Flash Samsung is faltering bad (see - 980 Pro Firmware issues, 990 Pro NAND wear, and Micron high layer NAND has a lot more TBW, Firecuda 530).
ECC is an optional feature that NV has available for error-sensitive compute applications. They generally recommend leaving it off (except on Gx100 products with HBM) as it incurs a 10% performance hit.
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
29 Comments
Back to Article
brucethemoose - Wednesday, March 8, 2023 - link
Since the short traces seem to work so well for Apple, should GDDR7 users be thinking about on-package memory? Its not like they are losing modularity moving away from DIMMs, and AMD is *already* using a multi chip package.kabushawarib - Wednesday, March 8, 2023 - link
There's already a solution for GPU memory on same package, it's called HBM.A5 - Friday, March 10, 2023 - link
HBM requires an expensive interposer, while on-package GDDR just requires standard trace routing.mode_13h - Wednesday, March 8, 2023 - link
I think GDDR memory runs a fair bit hotter, making it not so amenable to stacking like what Apple and Nvidia have done to put LPDDR5 in-package. And if you don't stack it, then I think the package would be far too big.As @kabushawarib said, if you need high bandwidth and want to go in-package, then you probably have no better option than HBM.
III-V - Thursday, March 9, 2023 - link
The dies are too big/numerous to do on packagekabushawarib - Wednesday, March 8, 2023 - link
I wish for a return of HBM to the consumer scene. It's faster and way more effecient then GDDR.Ryan Smith - Wednesday, March 8, 2023 - link
I'm right there with you, but it's not in the cards. HBM is a premium memory solution, with a price to match. It's going to remain solely in the domain of high margin server parts.evilpaul666 - Thursday, April 13, 2023 - link
Fury X was 2015. Radeon VII was 2019. The only reason we don't see HBM on high end GPUs, which have 2x'd in price since those cards, is because Nvidia doesn't give a shit about its gaming products beyond milking consumers for Datacenter/AI R&D money.The 4000 series cards are regressing in memory bandwidth compared to the previous generation. Nvidia are substituting in hacky, stretch-and-blur upscaling and "120 Hz TV" frame interpolation for faster hardware and actual rendering improvements at a higher price.
mode_13h - Wednesday, March 8, 2023 - link
Well, prices on HBM3 have reportedly shot way up, thanks to the AI boom. So, little chance of that happening any time soon.nandnandnand - Thursday, March 9, 2023 - link
If it was 2-3x and not 5x, it should be a consideration for premium cards. Maybe mixing both HBM and GDDR is possible, e.g. 4 GB + 28 GB.http://www.businesskorea.co.kr/news/articleView.ht...
HideOut - Thursday, March 9, 2023 - link
So you want two COMPLETELY different memory controllers on the GPU board and different signal rates and evertying? Thats even mroe expensive.Kevin G - Thursday, March 9, 2023 - link
The board level complexity wouldn’t change as HBM is on-package. Yes there is a cost there but with the market moving to chiplets anyway at the high-end, many of the complexities are already being made. One or two low capacity stacks of HBM would serve as a nice scratch space for a buffer in raster workloads while GDDRx memory is used for textures.At the board level, things might become easier if the inclusion of HBM on the package would permit lowering the bus width. In fact, it’d be interesting to see DIMM slots appear on a GPU for memory expansion and holding large data sets. Bonus if such a card supported CXL so it could actively extend memory capacity on the CPU side.
dotjaz - Friday, March 10, 2023 - link
So basically you understand nothing. It's literally cheaper and MUCH MUCH simpler to just stack up 3D Vcache than what you've suggested.Nobody cares about board level complexity. And it'd NOT be interesting to see DIMM on GPU. What's the point? The bandwidth is so abysmal you might as well just use system memory via PCIe 5.0.
Zoolook - Friday, March 17, 2023 - link
3D Vcache is several orders of magnitude more expensive than HBM (we are talking MB vs GB here), and I'd question that its simpler as well.mode_13h - Saturday, March 11, 2023 - link
> it’d be interesting to see DIMM slots appear on a GPU for> memory expansion and holding large data sets.
This idea is DoA, in the era of CXL. A CXL x16 link is about as fast as 64-bit DDR5-8000. That lets you put your large dataset in main memory or off on some other CXL.mem device. In either case, it could be shared in a cache-coherent way with other GPUs and CPUs in the system.
evilpaul666 - Thursday, April 13, 2023 - link
3060 Ti seems to support both GDDR6 and GDDR6X fine.evilpaul666 - Thursday, April 13, 2023 - link
Yeah, because they were doing it before that. 🙄meacupla - Wednesday, March 8, 2023 - link
GDDR7 sounds nice and all, but unless it is used as unified memory, it feels like it's a bit wasted.By that, I mean PS6 and next gen Xbox, and hopefully switch2.
mode_13h - Wednesday, March 8, 2023 - link
More bandwidth per pin lets you get more performance with a narrower memory bus. In RDNA 2, AMD got by with the top spec GPU having only 256-bit, but they had to go up to 384-bit for RDNA 3. If memory speeds to continue to increase, we might see them getting up to 512-bit, again. And that's expensive.So, if you care about perf/$, then it's in your interest to see GDDR speeds continue to increase.
mode_13h - Wednesday, March 8, 2023 - link
> If memory speeds to continue to increase, ...Should read: "If memory speeds don't continue to increase, ..."
Samus - Thursday, March 9, 2023 - link
It all comes down to architecture. AMD has seemed to historically favor a faster, wide memory bus, while nVidia has historically been less sensitive to the memory subsystem. Most AMD architectures are memory bandwidth starved while nVidia architectures are clock (effectively power\current) starved. GCN and Maxwell are two great examples of this; GCN-based Fiji cards brought HBM while Maxwell brought large L2 cache with a tremendously reduced memory bus, most cards having 128-bit, 192-bit, 224-bit or 256-bit opposed to AMD having a mainstream 256-bit to 384-bit, with 4096-bit at the high end.mode_13h - Thursday, March 9, 2023 - link
* I think you're over-generalizing. Don't try to fit trends, but rather look at it case-by-case.* Maxwell indeed went up to 384-bit, which it needed to counter Fury.
* Even then, Maxwell could only tie Fury by switching to tile-based rendering.
* Fury had 4096-bit, but the memory clock was just 1/6th of its GDDR5-based brethren.
* Vega only had 2048-bit, at a higher clock. Still, lower bandwidth than Fury.
* Vega 20 had 4096-bit, at an even higher clock, but didn't increase shaders or ROPs from Vega.
* Pascal further increased bandwidth efficiency by improving texture compression.
* RDNA 2 @ 256-bit was able to counter RTX 3000 @ 384-bit, by using Infinity Cache.
* RTX 4000 countered by increasing its L2 cache to comparable levels, will maintaining 384-bit.
Zoolook - Friday, March 17, 2023 - link
HBM was also less powerhungry, which AMD needed since their silicon was less efficient than Nvidias at that time, and it was an early bet on new technology, they were clearly betting on a more rapid uptake of HBM and lower future prices.HBM might have been a goner without AMDs bet on it which would have been a shame.
JasonMZW20 - Tuesday, March 14, 2023 - link
I think this just shows how much effort Nvidia put into delta color compression (DCC). By Pascal, Nvidia were able to compress a large majority of the screenspace, saving a ton of bandwidth in the process. AMD was also onboard the DCC train with Fiji/Tonga, and improved it again in Polaris. By Vega 10/Vega 20, it wasn’t external memory bandwidth holding these architectures back, rather registers and local caches at the CUs that were under increasing pressures, as well as poor overall geometry performance in graphics workloads. RDNA sought to fix those issues and did so.With the explosion of compute performance in high-end GPUs and also ray tracing, memory bandwidth is again a major limitation across the die, usually at the SM or CU level, as they expend registers and local caches processing ever larger datasets. Nvidia would not have dedicated so much die space to a very large 96MB L2 cache (full-die AD102 maximum) if it didn’t have significant benefits. Same for AMD in RDNA2 at 128MB L3 in Navi 21.
mode_13h - Wednesday, March 15, 2023 - link
> By Vega 10/Vega 20, it wasn’t external memory bandwidth holding these architectures backUm, the gains made by Vega 20 were primarily due to increasing memory bandwidth by more than 2x. That, and a clock speed bump. No microarchitecture changes, though.
But, Vega 20 was not made for gaming. Rather, it was destined for compute workloads, with a full contingent of fp64 and newly-added packed arithmetic instructions targeting AI.
> RDNA sought to fix those issues and did so.
Indeed, it significantly closed the gap between AMD GPUs' on-paper vs. real-world performance.
> With the explosion of compute performance in high-end GPUs and also
> ray tracing, memory bandwidth is again a major limitation across the die
If you just look at RDNA2 vs. RTX 3000, it seems pretty clear that AMD's outsized performance was largely thanks to Infinity Cache. Nvidia's resurgence in RTX 4000 suspiciously coincides with them adding massive amounts of L2 cache. These data points suggest that memory bandwidth is very much a live issue, even for raster performance.
Unfortunately, AMD took a step back on cache size in RDNA3. I hope we'll see a 3D V-Cache version of Navi 31 that steps it up to 192 MB or more. How nuts would it be if they even manage to reuse the very same V-Cache dies they're using in the Ryzen 7000X3D CPUs?
eSyr - Thursday, March 9, 2023 - link
PAM3 enables not 1.5, but rather log_2(3) bits per cycle (which can be rounded down to 1.5, but still); the fact that 1.5 equals to 3/2 might mislead the reader with regards to the way how information density is calculated.DanNeely - Thursday, March 9, 2023 - link
They're only using 8/9 possible symbols so 1.5 not 1.584... is the correct value. 0/0 isn't used presumably because they'd then need something else to indicate pathalogical data transmitted vs no signal.Silver5urfer - Thursday, March 9, 2023 - link
Now it makes sense. Why Nvidia top end GA102 cards (RTX3090Ti) and AD102 cards utilize ECC. In NVCP you can set the ECC on or off. HWbot started enforcing it for all the masses but for their own they allow it. As ECC enabling means less Memory performance, overall dragging the top scores. Ultimately PAM4 with such high clock / data rate = Errors / Noise so ECC was the solution that Nvidia implemented.G6X is really a beta test vehicle for all of Ampere. In fact entire Ampere lineup is a beta. MLCC reference shenanigans, then the Amazon game burning up cards, then the RTX3090 VRAM disaster which Nv says it's fine, Datasheets have 118C still it's way too high. Only card worth buying in Ampere is 3090Ti as they use 2Gb modules of Micron G6X and runs at full speed as intended without all this drama plus has ECC, and ofc it's not available anymore.
Moving along, I hope Samsung makes a comeback to Memory with GDDR7. Micron Memory is not really top end for GDDR. For NAND Flash Samsung is faltering bad (see - 980 Pro Firmware issues, 990 Pro NAND wear, and Micron high layer NAND has a lot more TBW, Firecuda 530).
A5 - Friday, March 10, 2023 - link
ECC is an optional feature that NV has available for error-sensitive compute applications. They generally recommend leaving it off (except on Gx100 products with HBM) as it incurs a 10% performance hit.