For me the most interesting is the information that they use N6 instead of N5.
Usually those high price professional prroducts where the first to make use of new nodes... Really wonder what was the reason as performance and consumption should be much better. Yield and performance should be good!
- Not enough capacities? Don't think it is an issue as volumes for those cards are not that extreme.
Apple still sell their old 2020 device (iPhone, iPad, Mac family) that all has processor fabbed on TSMC N5. including this year's model, they'r still using a lot of TSMC N5 capacity.
Actually, the high priced professional products are typically one of the later product categories to use a new node. They high priced professional products are usually huge server CPUs, GPUs, or some other accelerator. Enterprises do not want to pay extra for the low yields of such large chips on new processes where the yields can still be significantly improved. The norm for a new node is typically cellphone SOCs > laptop/desktop CPUs/APUs/SOCs > consumer dGPUs > high performance professional products (servers/workstations) > embedded & networking chips.
First, exceptions do not make the rule, they are outliers. Second, the A100 was released in 2020. Zen 2 was released on the same N7 process in 2019 and it was preceded by cellphone chips from the likes of Qualcomm, etc. Also, much like TSMC's N7, the Samsung 8nm process had already been used for cellphone SOCs for a few years by the time Nvidia launched the Ampere GeForce cards in 2020.
In fact I remember some people tried to dismiss AMD's RDNA2 based GPUs efficient advantage at the desktop high end because it was on TSMC's N7, which was better than Samsung's 8nm. However, we only knew it was better because we had seen comparisons between cellphone chips built using the two processes.
??? CDNA is N7, A100 is N7, Rome is N7. CDNA2 is going to be the first time in a while that AMD isn't putting their next gen product a high volume leading node.
The N7 chips in Rome were the same as those in Ryzen, and those were pretty small - it just used a lot of them. The large chip in Rome was based on Globalfoundries 14/12 nm. The nature of AMD's server CPUs makes it much easier for them to be released earlier compared to traditional monolithic chips. Chiplet designs could definitely change up this historic release pattern because the big CPU/GPU/accelerators would just be made up of higher yielding small chips.
Vega 20 came before CDNA1 and was less than half the size (~330mm2 vs ~750mm2). Vega 20 was admittedly fairly large, but was extremely low volume. AMD themselves stated it was mostly to work out kinks in the process and switching to TSMC (AMD was a bit desperate at the time to stay relevant in GPUs due to Vega's bad graphics scaling). TSMC's N7 had already been used in cellphone chips by that time but not laptop/desktop sized chips. So if anything, Vega 20 was an outlier.
Right, but none of those things change the fact that AMD moved to the leading high volume node for next gen products. There was physically no access to a better high volume leading node, so what did you expect them to do for those releases exactly? This time 5nm is available at high volume, but they're specifically choosing to go 6nm.
One possibility is that AMD chose N6 because they predicted that N6 would have a lower defect rate than N5.
These chips reportedly have 29 billion transistors. Defects are a bigger issue for large chips than for small chips because a defect can force you to discard an entire chip, which wastes a lot more silicon if you are manufacturing big chips than if you are manufacturing small chips.
As a point of reference for how large a 29 billion transistor chip is, a Zen 3 chip is reportedly 4.15 billion transistors. AMD plans to use N5 for Zen 4, which will likely contain somewhat more transistors than Zen 3, but certainly nothing close to 29 billion.
AMD improves the situation by disabling parts of the chip, allowing the chip to be used even if it contains defects. AMD apparently does not expect to get very many defect free chips, because it doesn't offer an SKU with all of the compute units enabled. But even with this strategy, AMD still has to worry about defect rates because some defects will affect circuits that are required for the chip to function at all, and can't be worked around by disabling a single compute unit.
> some defects will affect circuits that are required for the chip to function at all
This is why mesh interconnects will become more common, I think.
I wonder if we'll ever see a situation where *extra* components start getting added, like having 7 memory controllers on a chip that can only accommodate 6 memory channels, or something similar for PCIe.
Having the 2 Dies show as 2X GPU is interesting, I wonder how it works on the Client side of the Radeon series RDNA3 cards. Rest is crazy stuff which is always entertaining to see as always with these HPC parts.
Graphics has a lot of global data-movement, which is what makes multi-GPU rendering difficult. Scaling consumer GPUs to multiple dies is hard for similar reasons.
In compute workloads, there are at least cases where you have fairly good locality.
If both dies can get access to the same cache/VRAM, it may not be that bad, but you need a front end to decide where different things are executed. It's like the early days of multi-threaded programming where you have the front end(main program), and it can then spawn multiple threads on different CPU cores. Now, the front end is what the rest of the computer sees/talks to(like the I/O die in Ryzen), and behind that is where the MCM "magic" happens. As long as the access to VRAM can be coherent between things that have been split between multiple dies, it should work fairly well. CES is in 2 more months, and we should get information at that point.
pooled memory like that is one of the downsides because now you've got GPUs that are often already memory/bandwidth starved contending for access to the same pool. It's only a boon when you're either not fully utilizing both GPU dies or if you've spent a house payment on cache and HBM
The end game is a tile based architecture like checkerboard: red tiles compute and black tiles shared memory stacks between its neighbors. The compute tiles can also alternate between CPU, GPU, AI, FPGA etc. based upon the workload. That is the dream.
Heat, power consumption, cost and the manufacturing a large organic interposer for it all are the limitations for how big such a design can be. I would have thought that these factors would prevent such a design from coming to market but we have wafer scale chips now which is something thought impossible to do in volume manufacturing.
> tile based architecture like checkerboard: red tiles compute > and black tiles shared memory stacks between its neighbors.
Not sure about that. I think you'd put the compute tiles in the center, to minimize the cost of cache-coherency. I think that's a higher priority than reducing memory access latency.
> we have wafer scale chips now which is something thought > impossible to do in volume manufacturing.
...using the term "volume manufacturing" somewhat liberally. Yeah, what they did was cool and I guess it suggests that maybe you could have a chiplet-like approach that didn't involve any sort of inteposer or communication via the substrate.
Patents and other leaks have already revealed RDNA3's MCD that links the 2 GCDs. Only 1 GCD is allowed to communicate with host CPU, so host only sees MCM package as 1 GPU, unlike MI250X we see here. Should be interesting. IIRC, rendering is checkerboard-like to task both GCDs with pieces of rendered frame with Infinity Cache acting as a global cache pool for shared assets.
MCD is very likely dense interconnect fabric with larger Infinity Cache to keep data on package, as going out to GDDR6 has power and latency penalties.
For RDNA 3, a key to consumer graphics is that you want it to appear to be a single GPU, even across multiple dies. At that point, something like the I/O die in Ryzen will be a front end, with the multiple GPU dies behind it that will be "hidden". The running of shaders and such would then apply things to preferably a single die, but could potentially span multiple dies with some potential performance hits if that actually happened. On the CPU side, you don't have single threads that span multiple cores, but in graphics, single programmed shaders and such will run across multiple GPU cores(not dies), so that is where you get very good scaling with more GPU cores. Obviously, the designers at AMD will understand this stuff a lot better than I do, so they may have ways to avoid potential slowdowns due to things that span multiple dies.
Not really, everything is links by Infinity Fabric, including those in the same package. Putting two into the same package like this is a density play.
One thing worth noting is that a single die still has eight Infinity Fabric links and is an option for AMD to bring that to market. This would be a straight forward way to drop down power consumption without altering the underlaying platform these plug into.
Thanks Ryan! While I am certainly not in the market for one of those, I wonder if there is a chance of a deep(er) dive into the interconnect used? Unrelated, the choice of N6 makes sense, as TSMC has pushed its customers to port their N7 designs to N6 for a while now (read: they have capacity and get more chips per wafer vs N7), while N5 is basically hogged/reserved by Apple. Insisting on N5 at TSMC would probably both drastically raise costs and restrict numbers, as AMD would have had to take the leftover capacity Apple isn't using when it becomes available. Going to Samsung (if even possible) wouldn't have gained anything: As we know from one of Andrei's reviews, TSMC N7 is as efficient as Samsung's " 5 nm", and TSMC's N6 is, at minimum, not worse than their N7.
One thing I'd like to know is why has it taken AMD so long to enable CPU to GPU links via Infinity Fabric? Vega 20 was the first GPU to have those 25 Gbit links and it appears as if Rome on the CPU side had them as well. The pieces seemed to be there to do much of this on the previous generation of products, just never enabled.
Vega 20 used them for an over-the-top connector. And in Mac Pro, they used it for dual-GPU graphics cards (which I think could *also* accommodate an over-the-top bridge?).
I'm unclear how much benefit there'd be to using those links for host communication, instead of just PCIe 4.0 x16.
Wake me when AMD decides to support ROCm on its consumer GPUs.
Not that I care much, anyway, as HiP is basically CUDA behind the thinnest of veneers. AMD at least had a case, when they were in the Open CL camp, but now they've completely given away the API game to Nvidia.
I also wonder how much longer GPUs are going to be relevant for deep learning. AMD should snap up somebody like Tenstorrent or Cerebras, while it still has the chance.
I saw that it was announced, but even now I can't find any statement about it supporting consumer GPUs. If you know differently, please provide a link.
Also, last I checked AMD was the only one not yet on OpenCL 3.0. I hope it addresses that, too.
They split the compute and gaming architectures after RDNA, so I'm not sure how much use it would be to support that when it's diverging from the CDNA architectures they're selling into servers.
Tell Nvidia that. They did basically the same thing, as far as bifurcating their product lines, but I guess they should've dropped CUDA on their gaming cards?
A big thing that Nvidia got right about CUDA was supporting it across virtually their entire product stack. From the small Jetson Nano platform, up through nearly all of their consumer GPUs, and obviously including their workstation & datacenter products.
Students could then buy a GPU to use for coursework, whether deep learning, computer vision, or perhaps even scientific computing, and get double duty out of it as a gaming card. These same students are creating the next generation of apps, libraries, and frameworks. So, it's very short-term focused for AMD to prioritize HIP and HPC projects above support of their compute stack on all their gaming GPUs.
"to not just recapture lost glory, but to rise higher than the company ever has before"... and they did it! Rise, Zen, Ryzen... Epyc Lisa Su and Lisa team!
Tesla D1 (on N7) claimed 362 TFLOPs of BF16 and 22.6 TFLOPs of FP32 (vector?) at 400W, with 1.1ExaFLOS of FP16/CFP8 on 3000 D1 chips (354 nodes each). Still hard to compare since some of the other metrics aren't listed for both and there was never a deep dive into D1 here at AT afaik.
In the consumer space, GA102 already had 28.3 billion transistors @ 628 sq mm. I don't expect Nvidia to create a dual die graphics card just yet. Makes me wonder how big the RTX 4090 will be and if it will use EFB/InFO-L attached HBM memory.
The compute acceleration space is getting really interesting, but the fragmentation of this space is a real headache and I can't see anybody trying to fight this. Nvida requires CUDA, Apple requires Metal AMD? Intel? What are their programing interfaces?
Intel had One API as a proprietary solution that spans their product portfolio (including some dedicated AI accelerators and FPGA products). AMD was a big proponent of OpenCL back in the day. Intel and nVidia also support OpenCL as an interoperable solution though they favor their own proprietary solutions.
OpenCL was intended to unify compute on GPUs and other devices, but then everybody retreated to their own proprietary solutions. MS has DirectCompute, Google has RenderScript, Apple (co-inventor of OpenCL) has Metal, Nvidia has CUDA, and AMD basically adopted CUDA (but they call it "HiP").
That just leaves Intel pushing OpenCL, but as the foundation of their oneAPI.
Some people are advocating Vulkan Compute, but it turns out there are two different things people mean, when they talk about that. Some people mean using Vulkan graphics shaders to do compute, which is sort of like how early GPU-compute pioneers used Direct 3D and OpenGL shaders, before CUDA and OpenCL came onto the scene (yes, there were other niche GPGPU languages in the interim). However, there are actual Vulkan/SPIR-V compute extensions that aren't widely supported. The main problems with using Vulkan graphics kernels to do computation are their relaxed precision and lack of support by pure compute devices, FPGAs, DSPs, etc.
I think that's very smart of them too. CUDA is almost ubiquitous already, and the more effort ($) Nvidia puts into CUDA marketing/support/development, AMD automatically gains some benefit. Going it alone right now would be insane.
I disagree. It puts AMD in the position of having to copy everything Nvidia does. And if Nvidia figures out something that's hard for AMD to duplicate on their hardware or that Nvidia can patent, then AMD gets into trouble. It's a treacherous path with not much upside for AMD, and it makes them look like a cheap knockoff of Nvidia, which only increases Nvidia's perceived status as "the premium solution".
I get why AMD is doing it. They want to give CUDA users an option. If their sales team approaches a customer and the customer says "we have all this CUDA software", then HiP keeps that from abruptly ending the conversation.
However, for people who avoid CUDA for fear of getting locked into Nvidia, HiP isn't much of an answer. The only real options to steer clear of vendor lock-in are OpenCL and Vulkan Compute. So, I'm disappointed to see that AMD seems to have diverted resources and attention for HiP, from their OpenCL support efforts.
“ these links run at the same 25Gbps speed when going off-chip to other MI250s or an IF-equipped EPYC CPU. Besides the big benefit of coherency support when using IF, this is also 58% more bandwidth than what PCIe 4.0 would otherwise be capable of offering.”
Looks like AMD is targeting this for 1:1 speed with PCIe 5.0 which will be out relatively soon.
(Note I know little about this area of computation so if this comment makes no sense to an expert, enlighten me please. In my defence I’m just picking up on the comparison to PCIe 4.0 in the quote.
> By baking their Infinity Fabric links into both their server CPUs and their server GPUs, > AMD now has the ability to offer a coherent memory space between its CPUs and GPUs
This was Vega's big feature, remember all the noise they made about the high-bandwidth cache controller (HBCC)? That supposedly enabled cache-coherency across PCIe.
> MD hasn’t named it – or at least, isn’t sharing that name with us
The internal code name is Aldebaran. It's another big star, in keeping with the succession of Polaris, Vega, and Arcturus.
> Graphics Compute Die
The irony of this name is that the die doesn't actually have any graphics hardware. It's really just a compute die, with a video decode block in there somewhere.
That's right: it *cannot* play Crysis!
> sets the MI250 apart is that it only comes with 6 IF links, and it lacks coherency support.
So, you just program it like 2 independent GPUs that happen to share the same package?
> 560W TDP
Wow!
> OAM ... spec maxing out at 700W for a single card.
Wow!!
> Being a standardized interface, OAM also offers potential interoperability ...
Uh, but AMD is using its proprietary infinity fabric protocol over those links, right? So, I take it the interoperability is limited to having a generic motherboard that can accommodate accelerators of different brands, but not mixing & matching accelerators?
> Many of AMD’s regular partners slated to offer MI200 accelerators in their systems
I expect customer uptake to be relatively low. I hope I'm wrong about that, but AMD has made the strategic error of always targeting the last Nvidia launch as the mark to beat. Nvidia will certainly leapfrog them yet again. And, by this point, AMD has probably missed most of the market window for deep learning on GPUs.
Maybe AMD is worried Aldebaran will be too easily confused with Arcturus and (worse) Ampere?
I don't see a real upside of them publicizing the code name, but I do like the astronomical naming scheme. At least it scales well. By the time they run out of names, each machine will probably need a small star to power it!
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
61 Comments
Back to Article
Matthias B V - Monday, November 8, 2021 - link
For me the most interesting is the information that they use N6 instead of N5.Usually those high price professional prroducts where the first to make use of new nodes... Really wonder what was the reason as performance and consumption should be much better. Yield and performance should be good!
- Not enough capacities? Don't think it is an issue as volumes for those cards are not that extreme.
Matthias B V - Monday, November 8, 2021 - link
I think it must be 3D Stacking for N5 as N7/N6 just got into mass production and they might not be ready to stack on N5 yet...yeeeeman - Monday, November 8, 2021 - link
apple is eating everything in the first few months.Matthias B V - Monday, November 8, 2021 - link
Yeah Apple was supposed to be masisve and first on N5 but N5 is a 2020 Node and we are talsking 2022...Fulljack - Monday, November 8, 2021 - link
Apple still sell their old 2020 device (iPhone, iPad, Mac family) that all has processor fabbed on TSMC N5. including this year's model, they'r still using a lot of TSMC N5 capacity.intelresting - Monday, November 8, 2021 - link
All the macbooks and probably coming macpros are N5Zoolook - Tuesday, November 9, 2021 - link
I saw a figure that Apple has 80% of N5 capacity reserved for 2021 and that was a while back.vlad42 - Monday, November 8, 2021 - link
Actually, the high priced professional products are typically one of the later product categories to use a new node. They high priced professional products are usually huge server CPUs, GPUs, or some other accelerator. Enterprises do not want to pay extra for the low yields of such large chips on new processes where the yields can still be significantly improved. The norm for a new node is typically cellphone SOCs > laptop/desktop CPUs/APUs/SOCs > consumer dGPUs > high performance professional products (servers/workstations) > embedded & networking chips.vlad42 - Monday, November 8, 2021 - link
Also, the big chips just take longer to design and optimize than the smaller chips.mode_13h - Monday, November 8, 2021 - link
Counterexample: A100 used TSMC N7, while Nvidia's consumer GPUs used Samsung's 8 nm.vlad42 - Monday, November 8, 2021 - link
First, exceptions do not make the rule, they are outliers. Second, the A100 was released in 2020. Zen 2 was released on the same N7 process in 2019 and it was preceded by cellphone chips from the likes of Qualcomm, etc. Also, much like TSMC's N7, the Samsung 8nm process had already been used for cellphone SOCs for a few years by the time Nvidia launched the Ampere GeForce cards in 2020.vlad42 - Monday, November 8, 2021 - link
In fact I remember some people tried to dismiss AMD's RDNA2 based GPUs efficient advantage at the desktop high end because it was on TSMC's N7, which was better than Samsung's 8nm. However, we only knew it was better because we had seen comparisons between cellphone chips built using the two processes.whatthe123 - Monday, November 8, 2021 - link
??? CDNA is N7, A100 is N7, Rome is N7. CDNA2 is going to be the first time in a while that AMD isn't putting their next gen product a high volume leading node.vlad42 - Monday, November 8, 2021 - link
The N7 chips in Rome were the same as those in Ryzen, and those were pretty small - it just used a lot of them. The large chip in Rome was based on Globalfoundries 14/12 nm. The nature of AMD's server CPUs makes it much easier for them to be released earlier compared to traditional monolithic chips. Chiplet designs could definitely change up this historic release pattern because the big CPU/GPU/accelerators would just be made up of higher yielding small chips.Vega 20 came before CDNA1 and was less than half the size (~330mm2 vs ~750mm2). Vega 20 was admittedly fairly large, but was extremely low volume. AMD themselves stated it was mostly to work out kinks in the process and switching to TSMC (AMD was a bit desperate at the time to stay relevant in GPUs due to Vega's bad graphics scaling). TSMC's N7 had already been used in cellphone chips by that time but not laptop/desktop sized chips. So if anything, Vega 20 was an outlier.
whatthe123 - Monday, November 8, 2021 - link
Right, but none of those things change the fact that AMD moved to the leading high volume node for next gen products. There was physically no access to a better high volume leading node, so what did you expect them to do for those releases exactly? This time 5nm is available at high volume, but they're specifically choosing to go 6nm.intelresting - Monday, November 8, 2021 - link
good catchWrs - Monday, November 8, 2021 - link
Could be a power density issue. N5 chips from Apple top out at around 0.2 watt/mm2. For a GPU 0.3-0.5 is typical, so maybe N5 isn’t quite ready.KennethAlmquist - Friday, November 12, 2021 - link
One possibility is that AMD chose N6 because they predicted that N6 would have a lower defect rate than N5.These chips reportedly have 29 billion transistors. Defects are a bigger issue for large chips than for small chips because a defect can force you to discard an entire chip, which wastes a lot more silicon if you are manufacturing big chips than if you are manufacturing small chips.
As a point of reference for how large a 29 billion transistor chip is, a Zen 3 chip is reportedly 4.15 billion transistors. AMD plans to use N5 for Zen 4, which will likely contain somewhat more transistors than Zen 3, but certainly nothing close to 29 billion.
AMD improves the situation by disabling parts of the chip, allowing the chip to be used even if it contains defects. AMD apparently does not expect to get very many defect free chips, because it doesn't offer an SKU with all of the compute units enabled. But even with this strategy, AMD still has to worry about defect rates because some defects will affect circuits that are required for the chip to function at all, and can't be worked around by disabling a single compute unit.
mode_13h - Saturday, November 13, 2021 - link
> some defects will affect circuits that are required for the chip to function at allThis is why mesh interconnects will become more common, I think.
I wonder if we'll ever see a situation where *extra* components start getting added, like having 7 memory controllers on a chip that can only accommodate 6 memory channels, or something similar for PCIe.
patel21 - Monday, November 8, 2021 - link
"After reading their lowest point":reading -> reaching ??
Ryan Smith - Monday, November 8, 2021 - link
Yep. Thanks!Silver5urfer - Monday, November 8, 2021 - link
Having the 2 Dies show as 2X GPU is interesting, I wonder how it works on the Client side of the Radeon series RDNA3 cards. Rest is crazy stuff which is always entertaining to see as always with these HPC parts.mode_13h - Monday, November 8, 2021 - link
Graphics has a lot of global data-movement, which is what makes multi-GPU rendering difficult. Scaling consumer GPUs to multiple dies is hard for similar reasons.In compute workloads, there are at least cases where you have fairly good locality.
Targon - Monday, November 8, 2021 - link
If both dies can get access to the same cache/VRAM, it may not be that bad, but you need a front end to decide where different things are executed. It's like the early days of multi-threaded programming where you have the front end(main program), and it can then spawn multiple threads on different CPU cores. Now, the front end is what the rest of the computer sees/talks to(like the I/O die in Ryzen), and behind that is where the MCM "magic" happens. As long as the access to VRAM can be coherent between things that have been split between multiple dies, it should work fairly well. CES is in 2 more months, and we should get information at that point.whatthe123 - Monday, November 8, 2021 - link
pooled memory like that is one of the downsides because now you've got GPUs that are often already memory/bandwidth starved contending for access to the same pool. It's only a boon when you're either not fully utilizing both GPU dies or if you've spent a house payment on cache and HBMKevin G - Tuesday, November 9, 2021 - link
The end game is a tile based architecture like checkerboard: red tiles compute and black tiles shared memory stacks between its neighbors. The compute tiles can also alternate between CPU, GPU, AI, FPGA etc. based upon the workload. That is the dream.Heat, power consumption, cost and the manufacturing a large organic interposer for it all are the limitations for how big such a design can be. I would have thought that these factors would prevent such a design from coming to market but we have wafer scale chips now which is something thought impossible to do in volume manufacturing.
mode_13h - Wednesday, November 10, 2021 - link
> tile based architecture like checkerboard: red tiles compute> and black tiles shared memory stacks between its neighbors.
Not sure about that. I think you'd put the compute tiles in the center, to minimize the cost of cache-coherency. I think that's a higher priority than reducing memory access latency.
> we have wafer scale chips now which is something thought
> impossible to do in volume manufacturing.
...using the term "volume manufacturing" somewhat liberally. Yeah, what they did was cool and I guess it suggests that maybe you could have a chiplet-like approach that didn't involve any sort of inteposer or communication via the substrate.
nandnandnand - Tuesday, November 9, 2021 - link
"or if you've spent a house payment on cache and HBM"Maybe it do be like that. 128 GB HBM here, up to 512 MB Infinity Cache on the consumer side for (expensive) RDNA 3 possibly.
mode_13h - Tuesday, November 9, 2021 - link
> like the I/O die in RyzenGPUs have bandwidth 10x that of Ryzen. Putting an I/O die in a GPU would burn too much power for too little gain.
Now, if the I/O die were also a giant chunk of Infinity Cache, then maybe it could offset a large amount of that overhead.
JasonMZW20 - Wednesday, November 10, 2021 - link
Patents and other leaks have already revealed RDNA3's MCD that links the 2 GCDs. Only 1 GCD is allowed to communicate with host CPU, so host only sees MCM package as 1 GPU, unlike MI250X we see here. Should be interesting. IIRC, rendering is checkerboard-like to task both GCDs with pieces of rendered frame with Infinity Cache acting as a global cache pool for shared assets.MCD is very likely dense interconnect fabric with larger Infinity Cache to keep data on package, as going out to GDDR6 has power and latency penalties.
Targon - Monday, November 8, 2021 - link
For RDNA 3, a key to consumer graphics is that you want it to appear to be a single GPU, even across multiple dies. At that point, something like the I/O die in Ryzen will be a front end, with the multiple GPU dies behind it that will be "hidden". The running of shaders and such would then apply things to preferably a single die, but could potentially span multiple dies with some potential performance hits if that actually happened. On the CPU side, you don't have single threads that span multiple cores, but in graphics, single programmed shaders and such will run across multiple GPU cores(not dies), so that is where you get very good scaling with more GPU cores. Obviously, the designers at AMD will understand this stuff a lot better than I do, so they may have ways to avoid potential slowdowns due to things that span multiple dies.Kevin G - Tuesday, November 9, 2021 - link
Not really, everything is links by Infinity Fabric, including those in the same package. Putting two into the same package like this is a density play.One thing worth noting is that a single die still has eight Infinity Fabric links and is an option for AMD to bring that to market. This would be a straight forward way to drop down power consumption without altering the underlaying platform these plug into.
eastcoast_pete - Monday, November 8, 2021 - link
Thanks Ryan! While I am certainly not in the market for one of those, I wonder if there is a chance of a deep(er) dive into the interconnect used?Unrelated, the choice of N6 makes sense, as TSMC has pushed its customers to port their N7 designs to N6 for a while now (read: they have capacity and get more chips per wafer vs N7), while N5 is basically hogged/reserved by Apple. Insisting on N5 at TSMC would probably both drastically raise costs and restrict numbers, as AMD would have had to take the leftover capacity Apple isn't using when it becomes available. Going to Samsung (if even possible) wouldn't have gained anything: As we know from one of Andrei's reviews, TSMC N7 is as efficient as Samsung's " 5 nm", and TSMC's N6 is, at minimum, not worse than their N7.
mode_13h - Monday, November 8, 2021 - link
Zen 4 will use N5. So, fab availability might not have been the deciding factor, given that these should be a relatively low-volume product.Ryan Smith - Monday, November 8, 2021 - link
"I wonder if there is a chance of a deep(er) dive into the interconnect used?"A good chunk of what I already know is in this article. But what else would you like to know?
Kevin G - Tuesday, November 9, 2021 - link
One thing I'd like to know is why has it taken AMD so long to enable CPU to GPU links via Infinity Fabric? Vega 20 was the first GPU to have those 25 Gbit links and it appears as if Rome on the CPU side had them as well. The pieces seemed to be there to do much of this on the previous generation of products, just never enabled.mode_13h - Wednesday, November 10, 2021 - link
Vega 20 used them for an over-the-top connector. And in Mac Pro, they used it for dual-GPU graphics cards (which I think could *also* accommodate an over-the-top bridge?).I'm unclear how much benefit there'd be to using those links for host communication, instead of just PCIe 4.0 x16.
mode_13h - Monday, November 8, 2021 - link
* yawn *Wake me when AMD decides to support ROCm on its consumer GPUs.
Not that I care much, anyway, as HiP is basically CUDA behind the thinnest of veneers. AMD at least had a case, when they were in the Open CL camp, but now they've completely given away the API game to Nvidia.
I also wonder how much longer GPUs are going to be relevant for deep learning. AMD should snap up somebody like Tenstorrent or Cerebras, while it still has the chance.
flashmozzg - Monday, November 8, 2021 - link
Did you miss ROCm 5.0 announcment?mode_13h - Tuesday, November 9, 2021 - link
I saw that it was announced, but even now I can't find any statement about it supporting consumer GPUs. If you know differently, please provide a link.Also, last I checked AMD was the only one not yet on OpenCL 3.0. I hope it addresses that, too.
xhris4747 - Tuesday, November 9, 2021 - link
ROCm officially supports AMD GPUs that use following chips:GFX8 GPUs
“Fiji” chips, such as on the AMD Radeon R9 Fury X and Radeon Instinct MI8
“Polaris 10” chips, such as on the AMD Radeon RX 580 and Radeon Instinct MI6
“Polaris 11” chips, such as on the AMD Radeon RX 570 and Radeon Pro WX 4100
“Polaris 12” chips, such as on the AMD Radeon RX 550 and Radeon RX 540
GFX9 GPUs
“Vega 10” chips, such as on the AMD Radeon RX Vega 64 and Radeon Instinct MI25
“Vega 7nm” chips, such as on the Radeon Instinct MI50, Radeon Instinct MI60 or AMD Radeon VII
mode_13h - Wednesday, November 10, 2021 - link
But still no RDNA or RDNA2? That's a problem.Spunjji - Wednesday, November 10, 2021 - link
They split the compute and gaming architectures after RDNA, so I'm not sure how much use it would be to support that when it's diverging from the CDNA architectures they're selling into servers.mode_13h - Wednesday, November 10, 2021 - link
Tell Nvidia that. They did basically the same thing, as far as bifurcating their product lines, but I guess they should've dropped CUDA on their gaming cards?You sound just as clueless as AMD.
mode_13h - Wednesday, November 10, 2021 - link
A big thing that Nvidia got right about CUDA was supporting it across virtually their entire product stack. From the small Jetson Nano platform, up through nearly all of their consumer GPUs, and obviously including their workstation & datacenter products.Students could then buy a GPU to use for coursework, whether deep learning, computer vision, or perhaps even scientific computing, and get double duty out of it as a gaming card. These same students are creating the next generation of apps, libraries, and frameworks. So, it's very short-term focused for AMD to prioritize HIP and HPC projects above support of their compute stack on all their gaming GPUs.
Helmery - Monday, November 8, 2021 - link
"to not just recapture lost glory, but to rise higher than the company ever has before"... and they did it!Rise, Zen, Ryzen... Epyc Lisa Su and Lisa team!
hd-2 - Monday, November 8, 2021 - link
Tesla D1 (on N7) claimed 362 TFLOPs of BF16 and 22.6 TFLOPs of FP32 (vector?) at 400W, with 1.1ExaFLOS of FP16/CFP8 on 3000 D1 chips (354 nodes each). Still hard to compare since some of the other metrics aren't listed for both and there was never a deep dive into D1 here at AT afaik.blanarahul - Monday, November 8, 2021 - link
In the consumer space, GA102 already had 28.3 billion transistors @ 628 sq mm. I don't expect Nvidia to create a dual die graphics card just yet. Makes me wonder how big the RTX 4090 will be and if it will use EFB/InFO-L attached HBM memory.mmschira - Monday, November 8, 2021 - link
The compute acceleration space is getting really interesting, but the fragmentation of this space is a real headache and I can't see anybody trying to fight this.Nvida requires CUDA,
Apple requires Metal
AMD?
Intel?
What are their programing interfaces?
Kevin G - Tuesday, November 9, 2021 - link
Intel had One API as a proprietary solution that spans their product portfolio (including some dedicated AI accelerators and FPGA products).AMD was a big proponent of OpenCL back in the day. Intel and nVidia also support OpenCL as an interoperable solution though they favor their own proprietary solutions.
mode_13h - Tuesday, November 9, 2021 - link
OpenCL was intended to unify compute on GPUs and other devices, but then everybody retreated to their own proprietary solutions. MS has DirectCompute, Google has RenderScript, Apple (co-inventor of OpenCL) has Metal, Nvidia has CUDA, and AMD basically adopted CUDA (but they call it "HiP").That just leaves Intel pushing OpenCL, but as the foundation of their oneAPI.
Some people are advocating Vulkan Compute, but it turns out there are two different things people mean, when they talk about that. Some people mean using Vulkan graphics shaders to do compute, which is sort of like how early GPU-compute pioneers used Direct 3D and OpenGL shaders, before CUDA and OpenCL came onto the scene (yes, there were other niche GPGPU languages in the interim). However, there are actual Vulkan/SPIR-V compute extensions that aren't widely supported. The main problems with using Vulkan graphics kernels to do computation are their relaxed precision and lack of support by pure compute devices, FPGAs, DSPs, etc.
sirmo - Tuesday, November 9, 2021 - link
AMD uses HIP. Which is basically AMD's CUDA layer. They even have the tools which HIPify CUDA codebase to work on both CUDA and HIP transparently.LightningNZ - Tuesday, November 9, 2021 - link
I think that's very smart of them too. CUDA is almost ubiquitous already, and the more effort ($) Nvidia puts into CUDA marketing/support/development, AMD automatically gains some benefit. Going it alone right now would be insane.mode_13h - Wednesday, November 10, 2021 - link
I disagree. It puts AMD in the position of having to copy everything Nvidia does. And if Nvidia figures out something that's hard for AMD to duplicate on their hardware or that Nvidia can patent, then AMD gets into trouble. It's a treacherous path with not much upside for AMD, and it makes them look like a cheap knockoff of Nvidia, which only increases Nvidia's perceived status as "the premium solution".I get why AMD is doing it. They want to give CUDA users an option. If their sales team approaches a customer and the customer says "we have all this CUDA software", then HiP keeps that from abruptly ending the conversation.
However, for people who avoid CUDA for fear of getting locked into Nvidia, HiP isn't much of an answer. The only real options to steer clear of vendor lock-in are OpenCL and Vulkan Compute. So, I'm disappointed to see that AMD seems to have diverted resources and attention for HiP, from their OpenCL support efforts.
Tomatotech - Wednesday, November 10, 2021 - link
“ these links run at the same 25Gbps speed when going off-chip to other MI250s or an IF-equipped EPYC CPU. Besides the big benefit of coherency support when using IF, this is also 58% more bandwidth than what PCIe 4.0 would otherwise be capable of offering.”Looks like AMD is targeting this for 1:1 speed with PCIe 5.0 which will be out relatively soon.
(Note I know little about this area of computation so if this comment makes no sense to an expert, enlighten me please. In my defence I’m just picking up on the comparison to PCIe 4.0 in the quote.
Tomatotech - Wednesday, November 10, 2021 - link
)mode_13h - Wednesday, November 10, 2021 - link
> By baking their Infinity Fabric links into both their server CPUs and their server GPUs,> AMD now has the ability to offer a coherent memory space between its CPUs and GPUs
This was Vega's big feature, remember all the noise they made about the high-bandwidth cache controller (HBCC)? That supposedly enabled cache-coherency across PCIe.
https://www.anandtech.com/show/11002/the-amd-vega-...
> MD hasn’t named it – or at least, isn’t sharing that name with us
The internal code name is Aldebaran. It's another big star, in keeping with the succession of Polaris, Vega, and Arcturus.
> Graphics Compute Die
The irony of this name is that the die doesn't actually have any graphics hardware. It's really just a compute die, with a video decode block in there somewhere.
That's right: it *cannot* play Crysis!
> sets the MI250 apart is that it only comes with 6 IF links, and it lacks coherency support.
So, you just program it like 2 independent GPUs that happen to share the same package?
> 560W TDP
Wow!
> OAM ... spec maxing out at 700W for a single card.
Wow!!
> Being a standardized interface, OAM also offers potential interoperability ...
Uh, but AMD is using its proprietary infinity fabric protocol over those links, right? So, I take it the interoperability is limited to having a generic motherboard that can accommodate accelerators of different brands, but not mixing & matching accelerators?
> Many of AMD’s regular partners slated to offer MI200 accelerators in their systems
I expect customer uptake to be relatively low. I hope I'm wrong about that, but AMD has made the strategic error of always targeting the last Nvidia launch as the mark to beat. Nvidia will certainly leapfrog them yet again. And, by this point, AMD has probably missed most of the market window for deep learning on GPUs.
mode_13h - Wednesday, November 10, 2021 - link
Maybe AMD is worried Aldebaran will be too easily confused with Arcturus and (worse) Ampere?I don't see a real upside of them publicizing the code name, but I do like the astronomical naming scheme. At least it scales well. By the time they run out of names, each machine will probably need a small star to power it!
timt - Wednesday, November 10, 2021 - link
Are you sure there's 112 CUs? The die photo above has 102 distinct units (14 *8). Don't see where they might be hiding 10 more.michec - Tuesday, November 23, 2021 - link
14 x 8 = 112, not 102