“we are slowly moving into an era where how we package the small pieces of silicon together is just as important as the silicon itself”—/me gives this sentence stern IBM MCM look.
Uh huh, because IBM has done 3D die stacking in Z right? Or did you comment without reading the article? BTW IBM has been doing MCM forever, as have others. It’s like bragging about copper interconnect at this point in time.
It's not a die stack, but their eDRAM did actually stack the dram capacitor vertically below the transistors, so conceptually it's a similar idea - increase density by putting cache into the third dimension. I was always disappointed that no one else adopted the idea.
96 MB of L3 cache per core would be interesting as that might be enough cache to keep a few processes/libraries fully caches while context switching between them. Schedulers would be highly incentivized to keep processes pinned as much as possible to specific cores due to how warm those caches would be.
This requires 8 + 8 + 1 + 8 dies in total (8 of which are spacers). Wouldn't it be simpler and cheaper to use larger 120mm^2 chiplets or have a single L4 SRAM die on top of the IO die? N7 yields are more than good enough.
There is a case for the L4 cache on the IO die to act exclusively in the domain of a memory channel. This would act a large reorder buffer for the memory controller to optimize read/write turn arounds and some prefetching based purely on local memory controller access patterns which would involve requests outside of the local socket domain. Even a small L4 cache can so a decent gain dependent on the system architecture and workload. IBM did something like this for their POWER chips. And it should be noted the desktop workloads would actually be a poor fit for this.
In mobile an argument for a L4 cache to act as the system level cache for SoC blocks that don't normally have a large dedicate cache to themselves while the CPU/GPU blocks evolve to include their own private L3 caches.
The IO die should fit at least 768MB of L4. Besides allowing DRAM optimization as Kevin mentions, all of it could be used by a single core if needed, allowing applications with a ~800MB working set to run completely from SRAM.
Note yields on 7nm are good, and yields on SRAM dies are pretty much 100% irrespectively of their size.
They’d have to redesign the chiplets for 120mm2 - assuming that’s 16 cores with no L3. That would almost surely include changes to the ring bus as latency scales with stops on a ring. I’m already curious about L3 latency/bandwidth in the Milan-X parts. If they’ve made so few alterations to the Zen 3 CCD, I’d begin to suspect they tripled L3 but left bandwidth unchanged. Notice the lack of bandwidth stats in the latest marketing slides. TSVs can do so much more, if only they’d rework their interconnect.
On the L4 on IOD idea, the existing latency between each CCD and IOD is measurable. That’s not to say they can’t have both massive L3s and massive L4, but each saps the bonus of the other and the end result might not be economical.
Exactly. You've also got a process mismatch which may make TSVs more difficult to match across dies, and the IO die runs at a lower clock speed which also lessens any cache advantage. It would have to be a huge cache for it to offset the additional latency and still provide a win over DRAM.
I meant design one 8-core chiplet with 96MB of cache and then create 2 sets of masks, one for the full cache and one with the extra cache cut off. Or do a single chiplet with 64MB L3 and use it both for desktop and servers. This should have lower latency than adding an SRAM die on top.
I can't help but feel that adding a tiny SRAM die on top a tiny chiplet is overkill. It's a great technology, but it seems like a solution looking for a problem...
I think it gives them optionality. A separate mask set is likely far more expensive and this is unlikely to be a high volume product even on desktop threadripper. The sram chiplet is cheaper to make. The main ccd is already high volume so that cost is amortized and volume of this stacked part can be adjusted as per demand. Those who need the big cache will pay the extra cost. Risk to company is lower
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
24 Comments
Back to Article
eSyr - Monday, November 8, 2021 - link
“we are slowly moving into an era where how we package the small pieces of silicon together is just as important as the silicon itself”—/me gives this sentence stern IBM MCM look.E. Gadsby - Monday, November 8, 2021 - link
Uh huh, because IBM has done 3D die stacking in Z right? Or did you comment without reading the article? BTW IBM has been doing MCM forever, as have others. It’s like bragging about copper interconnect at this point in time.Samus - Tuesday, November 9, 2021 - link
IBM\Motorola were first to copper interconnects. Where do you think AMD got it from :)eSyr - Monday, November 8, 2021 - link
“768 MiB of L3 cache, unrivalled by anything else in the industry”—/me gives this sentence a stern IBM z15 SC chip look.kfishy - Monday, November 8, 2021 - link
That’s an eDRAM L4 cache, the AMD announcement is about SRAM based L3 cache.saratoga4 - Monday, November 8, 2021 - link
It's not a die stack, but their eDRAM did actually stack the dram capacitor vertically below the transistors, so conceptually it's a similar idea - increase density by putting cache into the third dimension. I was always disappointed that no one else adopted the idea.Rudde - Monday, November 8, 2021 - link
How few cores could you have enabled, and still use all the 96 MB of L3 cache on a CCD?nandnandnand - Monday, November 8, 2021 - link
The answer is probably 1, as long as the software can make use of it.EPYC 72F3 has 8 chiplets with 1 core enabled on each. It would be hilarious if they made a 3D V-Cache version of that.
Kevin G - Monday, November 8, 2021 - link
96 MB of L3 cache per core would be interesting as that might be enough cache to keep a few processes/libraries fully caches while context switching between them. Schedulers would be highly incentivized to keep processes pinned as much as possible to specific cores due to how warm those caches would be.shing3232 - Monday, November 8, 2021 - link
It doesn't matter. you can use all 96M regardlesswebdoctors - Monday, November 8, 2021 - link
Wow, soon the new meme will be can it cache Crysis?nandnandnand - Monday, November 8, 2021 - link
We're gonna need an L4.don0301 - Tuesday, November 9, 2021 - link
You, Sir, just won the Internet today :)nandnandnand - Monday, November 8, 2021 - link
804 MiB = 768 MiB L3 cache + 32 MiB L2 cache + 4 MiB L1 cacheWilco1 - Monday, November 8, 2021 - link
This requires 8 + 8 + 1 + 8 dies in total (8 of which are spacers). Wouldn't it be simpler and cheaper to use larger 120mm^2 chiplets or have a single L4 SRAM die on top of the IO die? N7 yields are more than good enough.nandnandnand - Monday, November 8, 2021 - link
Smaller dies = better yields. And if they do put an L4 cache on top of the I/O die it ought to be at least a few gigabytes.Kevin G - Monday, November 8, 2021 - link
There is a case for the L4 cache on the IO die to act exclusively in the domain of a memory channel. This would act a large reorder buffer for the memory controller to optimize read/write turn arounds and some prefetching based purely on local memory controller access patterns which would involve requests outside of the local socket domain. Even a small L4 cache can so a decent gain dependent on the system architecture and workload. IBM did something like this for their POWER chips. And it should be noted the desktop workloads would actually be a poor fit for this.In mobile an argument for a L4 cache to act as the system level cache for SoC blocks that don't normally have a large dedicate cache to themselves while the CPU/GPU blocks evolve to include their own private L3 caches.
Wilco1 - Tuesday, November 9, 2021 - link
The IO die should fit at least 768MB of L4. Besides allowing DRAM optimization as Kevin mentions, all of it could be used by a single core if needed, allowing applications with a ~800MB working set to run completely from SRAM.Note yields on 7nm are good, and yields on SRAM dies are pretty much 100% irrespectively of their size.
nandnandnand - Tuesday, November 9, 2021 - link
Just use DRAM/HBM for L4.Wrs - Monday, November 8, 2021 - link
They’d have to redesign the chiplets for 120mm2 - assuming that’s 16 cores with no L3. That would almost surely include changes to the ring bus as latency scales with stops on a ring. I’m already curious about L3 latency/bandwidth in the Milan-X parts. If they’ve made so few alterations to the Zen 3 CCD, I’d begin to suspect they tripled L3 but left bandwidth unchanged. Notice the lack of bandwidth stats in the latest marketing slides. TSVs can do so much more, if only they’d rework their interconnect.On the L4 on IOD idea, the existing latency between each CCD and IOD is measurable. That’s not to say they can’t have both massive L3s and massive L4, but each saps the bonus of the other and the end result might not be economical.
LightningNZ - Monday, November 8, 2021 - link
Exactly. You've also got a process mismatch which may make TSVs more difficult to match across dies, and the IO die runs at a lower clock speed which also lessens any cache advantage. It would have to be a huge cache for it to offset the additional latency and still provide a win over DRAM.E. Gadsby - Monday, November 8, 2021 - link
https://www.anandtech.com/show/16725/amd-demonstra...Bandwidth was 2TBps for the original stacking announcement. Would be surprising if that were also not the case here.
Wilco1 - Tuesday, November 9, 2021 - link
I meant design one 8-core chiplet with 96MB of cache and then create 2 sets of masks, one for the full cache and one with the extra cache cut off. Or do a single chiplet with 64MB L3 and use it both for desktop and servers. This should have lower latency than adding an SRAM die on top.I can't help but feel that adding a tiny SRAM die on top a tiny chiplet is overkill. It's a great technology, but it seems like a solution looking for a problem...
E. Gadsby - Wednesday, November 10, 2021 - link
I think it gives them optionality. A separate mask set is likely far more expensive and this is unlikely to be a high volume product even on desktop threadripper. The sram chiplet is cheaper to make. The main ccd is already high volume so that cost is amortized and volume of this stacked part can be adjusted as per demand. Those who need the big cache will pay the extra cost. Risk to company is lower