There hasn't been a lot of talk about when this will apply to Intel consumer parts, other than CPU+GPU in one package. AMD clearly spent a fair amount of time paying attention to the whole idea of how different components within a computer can talk to each other to better streamline the overall computer. As a result, we saw AMD linking Infinity Fabric to memory speed, and the hope is that with third generation Ryzen chips, we will see this go away and just run IF at a high speed, no matter what RAM is used in the system. That would then beg the question if that advancement would apply only to systems with a 500 series chipset, or if the 300 and 400 series chipsets would also benefit from that change.
I don't believe this is following AMD path at all and AMD did not event MCM technology. People like to give AMD more credit than they deserved.
I more curious about connection between EMiB and Foveros. The diagram with Package Technology Roadmap shows that Foveros is next evolution or possibly revolution of EMiB. But the AgileX diagram looks like EMiB is a connector.
It could be the Foveros is use stacking layers and using EMiB connect components. So Kaby G is just a single layer of Foveros. I think something important is really over look with importance of Kaby G with EMiB - this is important because it links different manufacture process even from different vendors ( Intel and AMD ) and it is important test ground for future Intel Plans.
Personally I think is crazy that we have super huge chips in todays world and going to 3D on chips is next revolution on packaging technology. I also see one day AMD licensing Foveros or duplicating their own version to reduce the size of there chips.
Maybe not following AMD directly, but AMD is sure ahead of the game, with MCM CPU's shipping for over a year now. Intel was quick to point out the flaws in MCM; but now they are gonna do it too. AMD paving the way as usual :)
I with EMiB in Kaby G, I would say that Intel is ahead of the game. Can AMD connect on same chip non AMD excluding Intel Kaby G Processor like in last years Dell XPS 15 2in1.
One Intel Chip that currently uses this technology is Kaby G processor that was release Q1 of 2018.
But lets stop looking at the past, this is about the future and everything is likely going to EMiB / Foveros in Intel's future include new graphics chips.
It would be interesting to see what next generation Dell XPS 15 2in1 is? My guess it will Intel CPU with Gen 11 Graphics. Performance wise I expect it to beat Kaby G in graphics area.
AMD does have experience with interposers with HBM, which yes is different from EMIB but when it comes down to it, they both have experience placing chips embedded on the same board (which is what they're doing with these other technologies, just in different manners). AMD has experience integrating other companies' IP into their chip as well. It really just comes down to syncing the necessary communication bits so the chips can properly talk to each other (not to make light of that, its actually a pretty difficult issue to get sorted out at transistor level). Which I think is the main driver behind AMD going to a separate I/O chip, it lets them customize that for different customers and as the need fits, while being able to maximize economies of scale of their base CPU/GPU components.
I think we're seeing both just starting on the transition to chiplet designs, so its hard to say where things will go from here. Which, even just for their own short term future we have a lot to see on what they can offer (CPU and GPU performance for instance).
I agree. AMD is taking an incremental approach and Intel is going more revolutionary with EMIB. I think both camps will end up at pretty much the same destination by 2022. I tend to think strategically AMD made the better decisions. Take Intel's 10nm where they were pushing the densities and it bit them hard. This is why it was mentioned Intel is designing for the process they have.
We would just have to agree to disagree with this - but I not sure they are heading in same direction - AMD is getting bigger and Intel is getting smaller especially with Fovores. But Intel has learn from it mistakes with 10 nm and coming back hard with new Favs and Sunny Cove architexture. Just remember that nm number does not matter what matter is how much technology can be put in same space and how fast.
" But Intel has learn from it mistakes with 10 nm and coming back hard with new Favs and Sunny Cove architexture. ( architecture ) " yea right.. unless you are able to see into the future HStewart, this is a false statement... until intel proves it... face it.. just like with the Athon64, intel was caught with its pants down... the sleeping giant, as you once put it.. finally woken up.. and is now playing catch up. once again.. some one puts amd up.. and you come along.. put them down.. but try to prop intel up...
From Invensas. They own the patent on it, Intel are one of their clients...Don't need to be Sherlock Holmes to figure it out ? Just because Intel say 'Pattented EMiB construction' doesn't mean that *they* own the patent.
Hard to say where Chipzillah is going with this. Are they moving tech forward or elbowing-out competitors with their market heft? TSVs/interposers with HBM have been around since the Radeon Fury. The Kaveri APU (I think) introduced the "original HSA arch" with GPU cores sniffing CPU L2 cache via 'Radeon Control Links'
Fast-forward 4 years, and I/O plus CPU cores have begat chiplets joined by 'Freedom Fabrics' on Socket SP3/TR4 LGA 4094, & DDR4 with four selectable unique bank groups. In another year ... ?
On-die HBM last-level cache on IF linking Ryzen & ARM chiplets sniffing/flushing L2 with an I/O chipper, and IOMMU, via TSVs and LGA 5128, DDR5 with eight selectable unique addressable bank groups, and hybrid RISC-CISC ISA?
The truth is out there, man, and comes down to cache coherency and a specialized, unique address space. HSA lives, man.
AMD for Life __ Intel for the Wife !
(I'm thinkin' you guys don't remember Thunder Man?)
Just calling Intel Chizillah gives no credit to your statements at all. This technology from Intel is the future and I would say one day it will either be replicated ( and AMD fans will say that AMD created it ) with HSA or what ever or they will be incline to use it.
The Truth of matter is actually AMD is a clone of Intel architexture ( please don't go into 64 bit - that is just evolution of original x86 designed - it would have happen eventually )
Maybe Intel is for the wife, which is not a bad thing, but it takes a wife to create kids and that is where AMD main user based. In life one must grow up from one's childless ways. I would never say AMD is for life, in fact basing your life on computer chip is just wrong.
And Zen is not going change this, do you really think Intel is a sleeping giant - certainly not extinct dinosaur - when Sonny Coves comes out AMD fans will be crying monopoly and other bs.
Hmm, the butthurt is strong with this one. And saying "don't go into 64 bit"; of course you don't want anyone saying it. It's one of the areas where AMD beats Intel, so a very sore spot for you. Christ, take a look at your post again and realize that it's you that has no life, being a cheerleader for a company that doesn't know you exist.
HStewart.. umm Zen already changed this... look at what intel has HAD to release since Zen 1st came out... now all of a sudden.. intels main stream cpu's, have MORE then 4 cores... their HEDT chips, more cores... as sa666666 " dont go into 64bit " why not?? cause AMD brought it to the mainstream, NOT intel?? or how about the on die memory controller.. yep.. AMD, NOT intel... if AMD didnt bring 64 bit or the on die memory controller to the mainstream.. how long would it of taken intel to do it ?? considering how long intel kept quad cores in the mainstream, even now.. we may still be stuck in 32 bit, and the chipset handling the memory controller....
HStewart reading the posts above this one.. as well as the ones below.. really do make you out as an intel fan boy.. some one mentions something good about AMD, you come in right after and bash them, and then prop up Intel....
" People like to give AMD more credit than they deserved. " actually.. amd deserves all the credit people are giving them... IMO.. AMD has done more for the cpu, since the Athlon 64 came out. then intel has in the same time frame...
You haven't even seen it yet and you are already declaring it the future.
You shit on AMD for bringing the first performant MCM design (there were many before but none good) to the market yet praise Intel for something you don't have the slightest idea of performance numbers on.
Just stop, we all know what camp you are in. For everyone else I'm going to wait to see the benchmarks before I declare it anything.
As time goes on that statement becomes more and more embarrassing. Maybe the non technical people should shut up and step down to avoid future embarrassment and let the people who have a clue do the decision making and speaking.
now that people realize that its ALOT harder to fabricate chips closer to the physical limits of atom spacing everyone will be looking at chiplet tech as a way to reduce cost. the article outlines some of the steps that need to be taken. chiplets need to be validated before being attached to eachother. bandwidth speed and power requirements all have to be addressed. it doesnt happen overnight but amd was smart and started working on it many years ago. and its proven that even with todays fabrication techs the chiplet design has viability. mostly with many core server / workstation chips. but this year we will see navi. the difficulty of adapting the gpu to chiplet designs is much higher. likely it will take many more transistors for the same performance however the cost of those transistors will be lower.
Unfortunately, while AMD pioneered HBM, it seems Nvidia was the first to really profit from it. In AMD's case, I wonder if it didn't just slow down the launch of Fury until Nvidia had something to launch against it ...that ended up being fast enough on plain GDDR5, no less.
At least with EPYC, AMD enjoyed a good run before Intel joined the multi-die party.
um i think they are kind of still enjoying that... if the program can use the cores.. isnt AMD's multi co cpu's ( ryzen, threadripper and epic ) all either on par, faster or close enough to intel that there is now an option if one doesnt want to play for the overpriced intel cpu's ?
Chiplets provide a lot more in addition to simple cost savings. WIth chiplets you can bin each one individually like what AMD does with threadripper to get very fast, power conservative, high core count chips. With monolithic chips you can't mix and match the best chiplets, you take the whole thing or leave it.
In addition chiplets allow you to modularize the CPU. What AMD is doing with the IO die allows them to get bigger chips with a ton of cache and a ton of cores all with equal access to that cache.
At some point you should also see an active interposer, which according to a study done by the University of Toronto, can have superior core to core latency if done correctly. That makes sense of course, an active interposer would intelligently route data using the shortest path. That's not something monolithic CPUs can do as they do not have a interposer directly connecting all the CPU resources.
With a monolithic CPU, you can actively route data through the CPU die. You don't need an active interposer. I'm not saying chiplets are bad - just that one point.
People always misunderstood this comment, since it always was a tad bit sarcastic.
Intel actually did the "Glue" thing first, back in the olden days. Back then, AMD made such Glue jokes about it. Then AMD adopts it a decade or so later, and Intel throws the joke back at them.
But because people have short memory, or could barely read the first time, the history is lost, and the cycle just repeats itself from the other side.
If you recall AMD had the first dual core chip so Intel glued two CPU's together to make the Pentium Extreme Edition 840 which sold for $1000. In this case Intel really did take two existing CPU's and glue them together with zero thought. In my mind AMD's approach with Zen was fairly elegant and to market first but Intel's new EMIB approach is even better. Who knows by the time there are desktop parts using EMIB we could see AMD's approach evolve even further.
The difference being people do not comment about your statement 10 years ago being wrong. It's not that people have a short memory, it's that it would be stupid to lambast someone for something said so far back. If you are holding statements for 10 years to eventually laugh in someone's face I'd imagine you'd have zero friends.
Intel is catching flak because within a year they announced they are now using glue. Time is the most important piece of context.
I'm pretty sure the only one's that looked stupid was Intel for making such a stupid comment and followed up with nice little charts as well. So if you want to try to make someone look stupid then point that towards Intel for making such a tard statement back then. Of coarse people will keep rehashing that statement it was a point in time people will remember when Intel was so butt hurt and had very little to say about it so they lashed out with the glued together statement. It will stick with them for a very long time get used to it.
I would not think of Intel packaging is like glue - think of like logical combine chip functional from different processes.
But if gluing chip on motherboard - technology is getting smaller and smaller and people try to things with technology that original not designed for and then complain it faulty chip when they do it. This is even more important in mobile device which now they want to take there phone into pools and such.
They dropped from the 5G modems, their GPU attempt in the pass were unsuccessful, their 10nm process is in shamble, their 14nm shortage is ridiculous, their Cascade lake is a joke at 350-400W.
So people believing Intel first try at chiplets, going straight through 3D stack, being successful, are mistaking. 10nm process proved that too many variables can be a real nightmare even for Intel. 3D stacking might be even worst than that.
This is PR for stealing AMD thunder, that's all I see.
I guess you shook your magic 8ball and know this for fact?
Also, if 10% across the board AND with even better reduction in power than early suggest either way this leans to AMD making ample ground in 2019, leading to 2020. just like 18, 17.
The only :"modern" intel ones IMO the 9k series very few were zomg besides the pricing, AMD really has them in a lock step mode, Intel has massive costs and walks a tight line that AMD does not have to worry about nearly as much.....AMD on a roll, actually walking what they talk, Intel has been pretty much all talk, they should just release, see what folks say, then do the bios, drivers etc, as it stands, using the same to drastically more than AMD power wise, mehhh, I will take a 10% loss in performance for a good chunk less $$$ and power usage reduction benefits (when actually using it) any day
Third generation Ryzen will see a 13-15 percent IPC boost, with the 7nm fab process allowing around a 20 percent boost to clock speeds. You put those together, and that's a big jump in one generation.
Intel is counting on 10nm to help, but that will only allow for the potential for higher clock speeds, but IPC hasn't really improved, so claiming that Intel will magically get better due to statements...wasn't 10nm on track back in 2015? Yea, you can really trust what Intel says when it comes to release schedules or performance.
Taking into account Zen 2' guestimated IPC increase, the switch to the 7nm node and the associated power efficiency gains and higher clocks expected from the new node, Zen 2 should be at least 25% faster than the original Zen and 20% faster than Zen+ (unless AMD screwed up the design and TSMC screwed up the high performance variant of their 7nm node). "At least" as in "worst case scenario". More realistically, though still rather conservatively, I expected a 25% gain over Zen+ and a 30% gain over the original Zen.
If the high end can hit a 5.0GHz overclock on all cores, and first generation Ryzen was hitting 4.0GHz on all cores, that's a 25 percent boost in clock speed alone. The 13-15 percent IPC figure was over Zen+, which brought a 3-5 percent IPC boost over first generation Ryzen with it.
So clock speeds alone if that 5.0GHz speed is correct would give the 25 percent boost, not even taking the IPC improvements into account.
An engineer sample of Zen 2 did tie the 9900K. Given that original zen improves clocks by 23% from engineering sample to retail sample, I expect Zen 2 to do well.
amd stated that ryzen 3 will offer on par performance with 9900k for gamers. this will be a first in a long time for amd to have a gaming cpu tie for the lead. i would take this claim seriously based on their radeon vii statements being pretty accurate. power requirements and pricing will be interesting though.
That was also with a demo unit that only had a single CCX. If AMD can pull off a 5.0GHz clock speed for all cores, and if AMD sets the price for 8 cores at the $330-$350 range, Intel is in for a rough time for the rest of the year. Yes, there are several, "if"s in there, because we can't know the exact specs until we get Computex leaks, and even then, there will be some doubt about the final performance numbers until review sites get a chance to test the release versions, but I expect that AMD can hit 8 [email protected] at the very least.
As 7 nm matures, AMD can always move desktop chips back to a single-die. That's how I expect they'll answer Sunny Cove. In the meantime, they're getting out a 7 nm chip sooner than it might otherwise be viable.
Don't get your hopes up for Sunny Cove. Intel themselves said their first iteration of 10nm won't be better then 14++.
It's going to be a minor bump in IPC and perhaps more cores. Anything more simply isn't in the cards. Yields are also going to be pretty bad for the 1st year.
This is a repeat of the Athlon days. AMD caught Intel with their proverbial pants down and was able to take advantage of it. Intel eventually got it's act together and came back with the Core2 products, knocking AMD back down. I expect the same thing will happen again, since Intel has many more resources and money than AMD. It may take longer this time though, as AMD has a good product in Ryzen.
Intel is still playing games with retailers to keep AMD products out of product displays though. With the Intel CPU shortage, you can't honestly think that retailers are putting all-Intel displays out there for products that aren't there in large numbers without there being something improper going on.
This time around, AMD is taking all the growth opportunities that Intel is unable to block. This is the first time that AMD can finally pierce the mobile and server market and that Intel cannot do anything about it because they are unable to supply the OEM. Well, they have it coming. There is no turning back at this point.
In 5 years, AMD could end up with 30-35% of the datacenter market. If they can manage similar figures in laptop, that would be disturbing.
As for gaming, well, it is over, all games are going to be developed on an AMD chip except Switch.
The big spoiler for AMD's plans would be ARM. ARM is gunning for both the laptop and server market, in a pretty big way. Still, AMD is such a small company that even 10% - 15% of the datacenter market would be huge, for them.
its more like pr for investors. if they had actual products comming for sure they would mention them. i think intel has a strong future but these things arent gonna be exciting to me personally for a long time
Now this could entirely be a coincidence, but the timing is a bit curious - first you have the "Intel exits 5G smartphone modem market" news (as a side note), followed shortly after by "look what great technology Intel has planned"
its just a report based on info provided by the company. and while parsed differently and with extra comments / analysis its hard to really have a critical writeup on a companies far future plans. most of this stuff is in VERY early development.
I'm sorry to hear you think that. But none the less that's good feedback for us to hear, as that's how we improve things.
So as a bit of background, Intel reached out to us to have a chat about future packaging/interconnect technologies. There was no specific presentation, just a chance to ask about the current state of affairs.
Generally speaking these are some of the cooler articles we get to work on, both because we get a chance to talk to a division of a company we don't usually get to talk to, and because we get to focus a bit on the future as opposed to benchmarking yet another chip right now.
But if you think it sounds like a paid advertisement, that's good feedback to have. Is there anything you'd like to have seen done differently?
this is actually a great article with great info. i think they were expecting something impossible based on emotional responses they have with companies failure / success. if you catered to that crowd then the articles would seem bloated with pointless bias.
I actually like when you talk to people "down in the trenches". So, kudos on that. I thought there was some useful detail in there. It definitely helped me understand and appreciate the differences between interposer and EMIB. It's also interesting to know that they haven't been looking into cooling for Foveros, since that's clearly one of the challenges in using it to increase compute density. And it's not too surprising, but I didn't know that EMIB assembly and packaging basically needed to happen at Intel - even for outside chiplets.
I think one point you're dancing around is that multi-die GPUs depend heavily on software innovations to improve data locality and load balancing. This should benefit even monolithic GPUs, as well.
Perhaps @YB1064 could cite which elements and aspects make it sound like an ad.
Some people appear to be missing the bigger questions this raises.
GPU chiplets? As noted in the article, [multiple] GPU chiplets suck even more than CPU chiplets. While you might want a single mask to provide all your GPUs, it is unlikely to be worth the cost of the latency. What this means is that Intel believes that even if they ever get 10nm to work, yield will suck. Or it could just be Raj's pet project that will keep Intel GPUs in the "fallback when your graphics card dies" level.
Pretty sure this isn't Intel copying AMD (again). Much of this work has been done by Apple, Samsung and the big phone players, although they don't have the power/heat issues that Intel will wrestle with (ignoring heat with 3d stacking? Yeah, that's a smart move. Please ignore the elephant in the room, and pretend it is a spherical cow).
This all looks great from a server perspective, but I'm not sure it has any effect on the laptop/desktop world I'm familiar with. Perhaps eliminating the power/heat costs will help shore up their laptop line vs. the Ryzen 3000 onslaught, but all of this looks like they really don't care about that world (unless you really believe in multiple GPU chiplets).
While the "ARM invasion" of the server world has been more talk and the occasional pile of money burned to the ground, Intel knows that the ARM world has most of these issues solved (although less so for server-level heat generation). I'd suspect that this is making sure they are ready to compete with Ampere if they ever get a competitive CPU going.
- note that my disdain for GPU chiplets wouldn't apply to nvidia making a tesla compute "chiplet" that is ~700mm**2 (about as big as you can make a chip and the size of current HPC nvida "GPUs"). Since they already use an interposer (for the HBM) it might not be much of a stretch to glue a few together (building a chip much larger would suffer the same issues and might not have any yield at all).
Ignoring heat issues is proving that Intel is just trying to attract clueless investors.
These heat issues will factor so much in the way the silicon is put together. So many physical variables are in play here and Intel just dismiss all of them. Guess what, stacking GPU chiplets is going to really work well for them if they are using liquid nitrogen... /sarcasm
its hardly a bad idea though. the gpu has always been the parallel specialist. the idea of continued transistor counts past the moores law boundry is the most likely future for the gpu.
Yes, it's a bit like a large, multi-die GPU. NVLink is cache coherent, so it can certainly be used in that way. It's up to software to make it scale efficiently.
I think you're off the mark on your GPU latency concerns. Power and bottlenecks will be the main issue with multi-die GPUs.
GPU architectures are already quite good at managing latency, which is why they can cope with higher-latency GDDR type memories. But shuffling around lots of data between dies could be a killer. So, what's really needed is software innovations - to improve locality and load balancing between the chiplets. Without that, a multi-die GPU would just be a slow, hot mess.
BTW, Nvidia has already spun their NVLink-connected multi-GPU server platforms as "the world's biggest GPU". Treating them as a single GPU indeed could give them a lead on the software front needed to enable efficient scaling with chiplet-based GPUs.
"Ramune did state that Intel has not specifically looked into advanced cooling methods for Foveros type chips, but did expect work in this field over the coming years, either internally or externally."
"When discussing products in the future, one critical comment did arise from our conversation. This might have been something we missed back at Intel’s Architecture Day in December last year, but it was reiterated that Intel will be bringing both EMIB and Foveros into its designs for future graphics technologies. "
Intel is going to shoot themselves in the foot.
Wang on the other hand has something different to say in a post-Koduri led RTG, where he added: "To some extent you're talking about doing CrossFire on a single package. The challenge is that unless we make it invisible to the ISVs [independent software vendors] you're going to see the same sort of reluctance. We're going down that path on the CPU side, and I think on the GPU we're always looking at new ideas. But the GPU has unique constraints with this type of NUMA [non-uniform memory access] architecture, and how you combine features... The multithreaded CPU is a bit easier to scale the workload. The NUMA is part of the OS support so it's much easier to handle this multi-die thing relative to the graphics type of workload".
Lol, this is actually completely backwards. GPUs have significantly less communication requirements that the typical server CPU. In fact, GPUs are specifically designed with effectively fully silo'd block on chip. SMs for instance, don't have a lot of communication going between them. You could rather easily develop a multi-chip GPU with a central cache/memory die connecting to multiple pipeline modules. The SMs really only need to talk to the caches and memory. Latency overhead is pretty minimal as well, after all the whole design of GPUs is to tolerate massive latency required for texturing.
I wouldn't be so dismissive of AMD's point. ISVs don't like multi-GPU, and a chiplet strategy looks a heck of a lot like that.
And you really can't put all your memory controllers in one chiplet - that would be a huge bottleneck and wouldn't scale at all. GDDR already uses narrow channels, so it naturally distributes to the various chiplets.
The way to do it is to distribute work and data together, so each chiplet is mostly accessing data from its local memory pool. Using Foveros, you might even stack the chiplet on this DRAM.
Not really. A multi-die strategy only looks like a multi-gpu strategy if the individual chiplets are exposed.
And everyone already puts all their memory controllers in 1 die, fyi. You want them all in 1 die to reduce cross talk between dies. The workload is such that distributing the memory results in massive communication overheads.
Hiding the fact that cores are on different dies doesn't make bottlenecks go away - it just hamstrings the ability of software to work around them.
And when you say "everyone already puts all their memory controllers in 1 die", who exactly is "everyone"? So far, the only such example is Ryzen 3000, which is a CPU - not a GPU. GPUs require an order of magnitude more memory bandwidth. Furthermore, this amount must scale with respect to the number of chiplets, which doesn't happen if you have one memory controller die.
Who's hiding cores? No one cares where the shader blocks are as long as they are fed.
And who is everyone? Nvidia and Radeon. We're talking GPUs here, fyi. 1 memory controller die is fine, it has rather easily connect to 512-1024 GB/s of bandwidth because existing GPUs already do that.
You said "A multi-die strategy only looks like a multi-gpu strategy if the individual chiplets are exposed", which means you aim to obscure the fact that GPU cores aren't all on the same die. That's what I meant by "Hiding the fact that cores are on different dies".
Anyway, I guess you're right and the industry is wrong for going NUMA. They are all idiots and you are the true genius. You should not waste your precious time and talent on here. Instead, go to Nvidia and tell them that they are on the wrong track. Then, hit up AMD and inform them that Wang (and probably also Lisa Su) doesn't really know how to design GPUs and you will step in and do it for them.
Um, by your standard of hiding cores, every GPU on the market hides cores. Programmers just generate a workflow, they don't need to know how many shader per SM or how many SMs per die.
NUMA is and always has been a last resort. Having designed multiple NUMA systems that have shipped for commercial revenue from multiple different vendors and covering 3 different ISAs, I'm fairly confident in saying that. NUMA is never a goal and it is always something you try to minimize as much as possible. Instead it is a compromise you make to work within the given constraints.
For GPUs with chiplets, those constraints that would require a NUMA design largely don't exist. And a Rome style design is actually the better path for the high end, enthusiast, and enterprise/hpc markets. It allows you to push the technology edge for the actual compute components as fast as possible and keeps costs down while allowing you to use trailing edge process for the mem/interface component and should, give large die effects on freq/power, allow a lower power and over all more efficient design. This is largely enabled for GPUs by the rather limited (to non-existent) shader to shader communication requirements (which is after all the primary computational model advantage of GPUs vs CPUs).
Let me repeat that somewhat for emphasis: CPUs have both an expectation and practice of computation to computation communication while GPUs do not and instead have an expectation and practice communication generally only between computational phases (a complex way of saying that GPUs are optimized to read from memory, compute in silos, and write back to memory).
> they don't need to know how many shader per SM or how many SMs per die.
Not if they're assuming a SMP memory model. Again, that's a problem, to the extent they control where things execute and where memory is allocated. I know OpenGL doesn't expose those details, but I know far less about Vulkan and DX12.
> GPUs ... have an expectation and practice communication generally only between computational phases
You're living in the past. Also, you underestimate how much communication there is between the stages of a modern graphics pipeline (which, contrary to how it may sound, execute concurrently for different portions of the scene). Current GPUs are converging cache-based, coherent memory subsystems.
Yeah, I've read those. They don't support your argument nor do the actual programming models used support your argument. The second you try to treat GPU memory as dynamically coherent, they slow to a complete crawl.
You missed the point, which is that whether you like it or not, GPUs are adopting a cache-based memory hierarchy much like that of CPUs.
Of course, they *also* have specialized, on-chip scratch pad memory (the Nvidia doc I referenced calls it "shared memory", while I think the operative AMD term is LDS). I don't expect that to disappear, but there's been a distinct shift towards becoming increasingly cache-dependent.
You're also ignoring the fact that GPUs are increasing dependence on the cache hierarchy, increasing cache sizes, and increasing coherence support. AMD's Vega has coherent L2 and optionally coherent L1. From the Vega Whitepaper (linked above):
"“Vega” 10 is the first AMD graphics processor built using the Infinity Fabric interconnect that also underpins our "Zen" microprocessors. This low-latency, SoC-style interconnect provides coherent communication between on-chip logic blocks"
Also, as I cited above, the L1 cache in Volta was promoted to be much more CPU-like:
"Integration within the shared memory block ensures the Volta GV100 L1 cache has much lower latency and higher bandwidth than the L1 caches in past NVIDIA GPUs. The L1 In Volta functions as a high-throughput conduit for streaming data while simultaneously providing high-bandwidth and low-latency access to frequently reused data—the best of both worlds."
So, to disregard changes in GPUs' cache organization and utilization so dismissively seems baldly disingenuous. I take it as a sign that you've nothing better to offer.
> CPUs have both an expectation and practice of computation to computation communication while GPUs do not and instead have an expectation and practice communication generally only between computational phases (a complex way of saying that GPUs are optimized to read from memory, compute in silos, and write back to memory).
The funny thing about this is that it actually argues *for* NUMA in GPUs and *against* NUMA in CPUs. So, thanks for that.
But it's also an overstatement, in the sense that CPU code which does too much global communication will scale poorly, and GPUs (and graphics shader languages) have long supported a full contingent of atomics, as graphics code stays a bit less in silos than you seem to think.
Um, then you don't understand it. Every shader block basically needs access to every memory blocks to maintain performance. Doing that in NUMA introduces massive bottlenecks.
You start using a lot of atomics in GPU code, it comes crashing to a halt pretty quickly.
You clearly know nothing about graphics programming. You can't design an architecture to solve a problem you don't understand.
Those who introduced atomics in shaders clearly knew what they were doing and the corresponding performance impacts, but sometimes there's just no way around it. It's also laughable to suggest that there's no locality of reference, in graphics. In particular, to say that right after claiming that GPUs "compute in silos", is quite rich.
You really have to make the memory shared between all chiplets. That is, less theoretical nVidia mcm (the picture in this article) and more Epyc Rome. Rome has a central io-hub that access memory. nVidia outlined a NUMA architecture.
AMDs approach is probably too slow for gaming purposes, but could turn out to be a nice compute card.
Shouldn't be too slow. You are adding a handful of latency to what is at best a 100+ ns latency path and upwards of 1000s of ns. In fact the asymmetrical design has a long history in graphic hardware design dating back decades. You only start to run into issues with symmetrical designs where every chip has to communicate with every other chip due to distributed memory.
This whole too slow for gaming purposes is just laughable. You are realistically dealing with loaded memory latencies well over 100ns in a graphic chip doing a gaming workload. The die crossing is going to add single digit latencies to the path overall. Its not going to have any effect on something that is measured in ms esp since the whole design of a graphic shader pipeline is heavy designed from inception to be latency insensitive by necessity.
And honestly, in both cases of gpus and server cpus, Rome's design makes much more architectural sense than the previous epyc designs that were primarily constrained to allow 100% reuse of the consumer part in the server market.
Bottlenecks isn't much of an issue given an EMIB interconnect. Power will increase slightly, but could largely be made up from not having to deal with large die effects.
Focusing on EMIB misses the point. The point is you have a situation where all GPU cores from all dies are equidistant from all memory controllers. You're forcing the memory controller die to have a giant, multi-TB/sec crossbar. If you read any coverage of NVSwitch, you'd know it already burns a significant amount of power, and what you're talking about is even a multiple of that level of bandwidth.
Granted, a good amount of that is to drive the NVLink signals further distances than required for in-package signalling, but a non-trivial amount must be for the switching, itself.
To scale *efficiently*, the way to go is NUMA hardware + NUMA-aware software. This is the trend we've been seeing in the industry, over the past 2 decades. And it applies *inside* a MCM as well as outside the package, particularly when you're talking about GPU-scale bandwidth numbers. The only reason 7 nm EPYC can get away with it is because its total bandwidth requirements are far lower.
GPUs already have giant multi-TB/sec crossbars. How do you think they already connect the shader blocks to the memory controller blocks?
NVSwitch is not even close to equiv to what we are creating here. We're doing a on package optimized interconnect and simply moving what would of been a massive xbar in a monolithic GPU into a separate die connecting to a bunch of smaller dies that are effectively shader blocks just like an Nvidia SM. You are grossly mixing technologies here. NVLink is a Meter+ external FR4 interconnect, it has basically nothing in common with an on package interconnect.
NVSwitch power should be almost all link based. I've actually designed switch/routers that have shipped in silicon in millions of devices, they don't actually take up that much power.
And no, NUMA doesn't make much sense for a GPU system. It significantly increases the overheads for minimal to no benefit. NUMA isn't a goal, it is a problem. Always has been always will be. If you don't have to go down that path, and you don't for GPUs, then you don't do it. We do it for CPUs because we need to scale coherently across multiple packages, which is not and has never been a GPU issue.
> GPUs already have giant multi-TB/sec crossbars. How do you think they already connect the shader blocks to the memory controller blocks?
There are obviously different ways to connect memory controllers to compute resources, such as via ring and mesh topologies. In Anandtech's Tonga review, they seem to call out Tahiti's use of crossbar as somewhat exceptional and costly:
"it’s notable since at an architectural level Tahiti had to use a memory crossbar between the ROPs and memory bus due to their mismatched size (each block of 4 ROPs wants to be paired with a 32bit memory channel). The crossbar on Tahiti exposes the cards to more memory bandwidth, but it also introduces some inefficiencies of its own that make the subject a tradeoff."
...
"The narrower memory bus means that AMD was able to drop a pair of memory controllers and the memory crossbar"
> NUMA isn't a goal, it is a problem. ... If you don't have to go down that path, ... then you don't do it.
No debate there. Nobody wants it for its own sake - people do it because it scales. Crossbars do not. To feed a 4-chiplet GPU, you're probably going to need 4 stacks of HBM2, which would seem to mean 32-channels (assuming 128 bits per channel). So, we're probably talking about something like a 32x32 port crossbar. Or take GDDR6, which uses 16-bit channels, meaning a memory data bus 512 bits wide would also have 32 channels. Nvidia's TU102 uses 384-bit, so I figure a proper multi-die GPU architecture should be capable of surpassing that.
One issue is that you're assuming GPUs don't already have some form of NUMA. Even CPUs, like Intel's previous-generation HCC Xeon dies and their current-gen server dies have memory controllers distributed throughout their topology in a way that limits connectivity andincreases latency non-uniformly. Why wouldn't they have just used a crossbar, if they're as cheap as you claim?
But the real win NUMA can offer is to keep non-shared memory accesses local. So, you don't burn power shuffling around more data than necessary, you don't risk excessive bank conflicts, and (bonus) you don't need enormous, hot switches with enough bandwidth to handle all of your memory traffic.
The only reason not to consider NUMA for what's essentially a supercomputer in a package is what Wang cites, which is that ISVs have gotten used to treating graphics cores as SMP nodes. So, you've got to either bring them along, or devise a NUMA architecture that behaves enough like SMP that their existing codebase doesn't take a performance hit.
Because their traffic patterns are fundamentally different than GPUs. GPUs don't have a lot to any cross communication between computational resources. Shader Block0 doesn't really care what Shader Block1 is doing, what memory it is updating, etc. Communication is basically minimal to the point that they generally have a separate special system to allow that communication in the rare cases it is required so that the requirement doesn't impact the primary path. In contrast, CPU's computational resources are in constant communication with each other. They are always maintaining global coherence throughout the entire memory and cache stack.
GPUs basically don't have non-shared memory access per se. That's what presents all the complexities with multi-gpu setups. Shader block0 is just as likely to need to access memory blockA or blockB as shader block1. For CPUs, there are plenty of workloads that as long as they maintain coherence, don't have a lot of overlap or competition for given memory regions (and designs like AMD's original EPIC/Ryzen design do best on these effectively shared nothing workloads).
GPUs fundamentally need UMA like access for graphics workloads.
> Shader Block0 doesn't really care what Shader Block1 is doing, what memory it is updating, etc.
That's precisely the case for NUMA, in GPUs.
> GPUs basically don't have non-shared memory access per se. That's what presents all the complexities with multi-gpu setups.
Actually, the complexities with multi-GPU setups are largely due to the lack of cache coherency between GPUs and the orders of magnitude longer latencies and lower communication bandwidth than what we're talking about. It's no accident that NVLink and AMD's Infinity Fabric both introduce cache coherency.
> GPUs fundamentally need UMA like access for graphics workloads.
The term of art for which you're reaching would be SMP. You want all compute elements to have a symmetric view of memory. Unified Memory Architecture typically refers to sharing of a common memory pool between the CPU and GPU.
Go ahead an write a program that relies on inter shader coherence and see how that works for you...
No it isn't the case for NUMA, because they all want access to all of memory, they just want read access. Make it NUMA and you'll be link limited in no time.
SMP is symmetrical multi processor. It is a term of art that means that the computational units are the same. It does not actually describe memory access granularity and complexity of the system. NUMA/UMA are the correct terms for referring to memory access granularity and locality.
> Go ahead an write a program that relies on inter shader coherence and see how that works for you...
The same is true for multi-core CPUs. At least GPUs' deeper SMT implementation can keep shader resources from sitting idle, as long as there's enough other work to be done.
If current GPU shader programming languages had something akin to OpenCL's explicit memory hierarchy and work group structure, it would be easier for the runtime to allocate shared resources and schedule their usage closer to each other. That would be a big enabler for GPUs to go NUMA. That said, you can't eliminate global communication - you just need to reduce it to within what the topology can handle.
> No it isn't the case for NUMA, because they all want access to all of memory, they just want read access.
I don't know why you think CPUs are so different. Most software doesn't use core affinity, and OS schedulers have long been poor at keeping threads and memory resources together. Efficient scaling only happens when software and hardware adapt in a collaborative fashion. OpenCL was clearly a recognition of this fact.
Also, GPUs already sort of have the problem you describe, with ever narrower memory channels. The 16-bit channels of GDDR6 could easily get over-saturated with requests. The GPU shader cores just see this as more latency, which they're already well-equipped to hide.
> SMP is symmetrical multi processor. It is a term of art that means that the computational units are the same. It does not actually describe memory access granularity and complexity of the system.
Centralized memory is fundamental to the definition of SMP.
"Symmetric multiprocessing (SMP) involves a multiprocessor computer hardware and software architecture where two or more identical processors are connected to a single, shared main memory, have full access to all input and output devices, ..."
It remains to be seen as to whether AMD can pull this out.
They have danced around unified memory and cache coherency. Bringing it all together in a heterogeneous arch is the holy grail. The good news is AMD has been building for this moment for almost ten years, from the fusion APUs with SIMD/AVX 'graphic engines' to chiplet designs with independent I/O. This. Is. Hard. Stuff. We've come quite a-ways from the fusion 'onion and garlic' days.
Bank collisions, cache thrashing, etc, are the bad, old days of NUMA and page faults. Hopefully, Rome (and improved IOMMU) will incrementally move the ball forward toward a heterogeneous unified memory between cores/chiplets/bank groups, etc. Otherwise, we're stuck in a brute-force mindset.
The Rome approach won't scale. GPUs have far higher memory bandwidth requirements than CPUs. Nvidia's not stupid. They went with an EPYC (gen 1)-like architecture for good reasons.
Let me summarize. Intel: "Look at what we could possibly make in the future, so that you don't consider adopting 7nm Rome/Ryzen/Navi in two months.....Never mind that we will not have anything in hand until late 2020.....Just look at the monkey!"
lol its intel again. intel fail 5g because cannot meet deadline what is different between 10nm and 5g. we will found this year if intel really has trust
So now Intel is going to glue chips together? Funny think to use your bullshiting against AMD as your new product strategy... -Intel We are going to use chiplets... -AMD Hold my beer
People are too interested in jumping straight into AMD vs Intel to look at the big picture here: WHY is Intel pushing this tech?
Why would you design using chiplets rather than a single SoC? Under "normal" conditions, chiplets are going to be more expensive (packaging is never free) and more power (off chip, even via EMIB or interposer, is always more expensive than on-chip).
Some reasons that spring to mind: - integrating very different technologies. This is, eg, the HBM issue, DRAM fabs are different from logic fabs.
- integrating IP from different companies. This is the Kaby-Lake G issue.
- you can't make a die the size you want. This is the AMD issue. It's not THAT interesting technically in that that's just a fancy PCB (much more expensive, much more performant); and the target market is much smaller. If the economics of yield rates vs interposer change, AMD shifts how they do multi-module vs larger die.
- OR the last reason, a variant of the first. To mix and match the same technology but at different nodes.
NOW let's ask the Intel question again. Why are they doing this? Well the only product they've told us about as using Foveros is Lakefield. Lakefield is a ~140mm^2 footprint. For reference, that's the same size as an A12X shipping on TSMC7nm. Why use Foveros for this thing? It's NOT that large. It's NOT using fancy memory integration. (There is packaged DRAM, but just basic PoP like phones have been doing for years.) It's not integrating foreign IP. What it IS doing is creating some of the product on 10nm, and some on 22nm.
So why do that? Intel will tell you that it saves costs, or some other BS. I don't buy that at all. Here's what I think is going on. It's basically another admission that 10nm has, uh, problematic, yields. The lousy yields mean Intel is building the bare minimum they have to on 10nm, which means a few small cores. What CAN be moved off chip (IO, PHYs, stuff like that) HAS been moved off, to shrink the 10nm die.
So I see this all as just one more part of Intel's ever-more-desperate EGOB, a desperate juggling of balls. Foveros is introduced as a great step forward, something that can do things current design can not. That's not actually true, not for a long while, but Intel has to keep extending the hype horizon to stay looking viable and leading edge. Meanwhile they ALSO have to ship something, anything, in 10nm, to keep that ball in the air. So marry the two! Use a basically inappropriate match (a design that would not make sense for Apple or QC, because it's more expensive and higher power than just making a properly designed SoC of the appropriate size) to allow shipping Lakefield in acceptable volumes, while at minimal die size! And as a bonus you can claim that you're doing something so awesome with new tech!
OK, so the Intel fans will all disagree, this is crazy. OK, sure, but ask yourselves: - do you REALLY believe that it's cheaper to make two separate dies and package them together than to make a larger die? That goes against the economics of semiconductor logic that's held since the 70's. It ONLY works if the cost of 10nm is astonishingly high compared to 22nm (which is another way of saying terrible yield). - If this is not being done because of 10nm yield, then why is it being done? To save 100mm^2 area? Give me a break! 1sq cm is essential on a watch. It's nice (but inessential) on a phone. It's irrelevant on a laptop. If you're so starved for area in a laptop, even a really compact one, there are simpler options like the sandwiched PCB used by the iPhone X, and now by (I think) the newest Samsung phones.
Fair point. I hadn't realized Lakefield was so small.
Of course, I'd argue that Kaby-G didn't make a heck of a lot of sense, either. Maybe it's just Intel trying to sell Foveros and drive their customized processor business.
Why ? Simple. Because its still far cheaper than trying to make huge monolithic dies on increasingly smaller process nodes.
You need to understand that certain defects like edge roughness don't scale. The smaller the lines you want to create, the bigger (in relation) such defects get. So as nodes get smaller they get increasingly hard to make defect free, irrespective of the size of the die.
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
117 Comments
Back to Article
FireSnake - Wednesday, April 17, 2019 - link
So they will follow path of AMD - with chiplets. Nice :)Targon - Wednesday, April 17, 2019 - link
There hasn't been a lot of talk about when this will apply to Intel consumer parts, other than CPU+GPU in one package. AMD clearly spent a fair amount of time paying attention to the whole idea of how different components within a computer can talk to each other to better streamline the overall computer. As a result, we saw AMD linking Infinity Fabric to memory speed, and the hope is that with third generation Ryzen chips, we will see this go away and just run IF at a high speed, no matter what RAM is used in the system. That would then beg the question if that advancement would apply only to systems with a 500 series chipset, or if the 300 and 400 series chipsets would also benefit from that change.HStewart - Wednesday, April 17, 2019 - link
I don't believe this is following AMD path at all and AMD did not event MCM technology. People like to give AMD more credit than they deserved.I more curious about connection between EMiB and Foveros. The diagram with Package Technology Roadmap shows that Foveros is next evolution or possibly revolution of EMiB. But the AgileX diagram looks like EMiB is a connector.
It could be the Foveros is use stacking layers and using EMiB connect components. So Kaby G is just a single layer of Foveros. I think something important is really over look with importance of Kaby G with EMiB - this is important because it links different manufacture process even from different vendors ( Intel and AMD ) and it is important test ground for future Intel Plans.
Personally I think is crazy that we have super huge chips in todays world and going to 3D on chips is next revolution on packaging technology. I also see one day AMD licensing Foveros or duplicating their own version to reduce the size of there chips.
AshlayW - Wednesday, April 17, 2019 - link
Maybe not following AMD directly, but AMD is sure ahead of the game, with MCM CPU's shipping for over a year now. Intel was quick to point out the flaws in MCM; but now they are gonna do it too. AMD paving the way as usual :)HStewart - Wednesday, April 17, 2019 - link
I with EMiB in Kaby G, I would say that Intel is ahead of the game. Can AMD connect on same chip non AMD excluding Intel Kaby G Processor like in last years Dell XPS 15 2in1.One Intel Chip that currently uses this technology is Kaby G processor that was release Q1 of 2018.
But lets stop looking at the past, this is about the future and everything is likely going to EMiB / Foveros in Intel's future include new graphics chips.
It would be interesting to see what next generation Dell XPS 15 2in1 is? My guess it will Intel CPU with Gen 11 Graphics. Performance wise I expect it to beat Kaby G in graphics area.
darkswordsman17 - Wednesday, April 17, 2019 - link
AMD does have experience with interposers with HBM, which yes is different from EMIB but when it comes down to it, they both have experience placing chips embedded on the same board (which is what they're doing with these other technologies, just in different manners). AMD has experience integrating other companies' IP into their chip as well. It really just comes down to syncing the necessary communication bits so the chips can properly talk to each other (not to make light of that, its actually a pretty difficult issue to get sorted out at transistor level). Which I think is the main driver behind AMD going to a separate I/O chip, it lets them customize that for different customers and as the need fits, while being able to maximize economies of scale of their base CPU/GPU components.I think we're seeing both just starting on the transition to chiplet designs, so its hard to say where things will go from here. Which, even just for their own short term future we have a lot to see on what they can offer (CPU and GPU performance for instance).
FreckledTrout - Monday, April 22, 2019 - link
I agree. AMD is taking an incremental approach and Intel is going more revolutionary with EMIB. I think both camps will end up at pretty much the same destination by 2022. I tend to think strategically AMD made the better decisions. Take Intel's 10nm where they were pushing the densities and it bit them hard. This is why it was mentioned Intel is designing for the process they have.HStewart - Monday, April 22, 2019 - link
We would just have to agree to disagree with this - but I not sure they are heading in same direction - AMD is getting bigger and Intel is getting smaller especially with Fovores. But Intel has learn from it mistakes with 10 nm and coming back hard with new Favs and Sunny Cove architexture. Just remember that nm number does not matter what matter is how much technology can be put in same space and how fast.Korguz - Monday, April 22, 2019 - link
" But Intel has learn from it mistakes with 10 nm and coming back hard with new Favs and Sunny Cove architexture. ( architecture ) " yea right.. unless you are able to see into the future HStewart, this is a false statement... until intel proves it... face it.. just like with the Athon64, intel was caught with its pants down... the sleeping giant, as you once put it.. finally woken up.. and is now playing catch up.once again.. some one puts amd up.. and you come along.. put them down.. but try to prop intel up...
Haawser - Wednesday, April 17, 2019 - link
Intel didn't invent EMIB either. They license it afaia- https://patents.google.com/patent/US20180108612A1/... Not sure if it's an exclusive license though.HStewart - Monday, April 22, 2019 - link
Where is the information that Intel license for EMiB. But next revolution is Fovores.Haawser - Thursday, April 25, 2019 - link
From Invensas. They own the patent on it, Intel are one of their clients...Don't need to be Sherlock Holmes to figure it out ? Just because Intel say 'Pattented EMiB construction' doesn't mean that *they* own the patent.Smell This - Wednesday, April 17, 2019 - link
"Harmonizing the Industry around Heterogeneous Computing"► http://www.hsafoundation.com/
Hard to say where Chipzillah is going with this. Are they moving tech forward or elbowing-out competitors with their market heft? TSVs/interposers with HBM have been around since the Radeon Fury. The Kaveri APU (I think) introduced the "original HSA arch" with GPU cores sniffing CPU L2 cache via 'Radeon Control Links'
Fast-forward 4 years, and I/O plus CPU cores have begat chiplets joined by 'Freedom Fabrics' on Socket SP3/TR4 LGA 4094, & DDR4 with four selectable unique bank groups. In another year ... ?
On-die HBM last-level cache on IF linking Ryzen & ARM chiplets sniffing/flushing L2 with an I/O chipper, and IOMMU, via TSVs and LGA 5128, DDR5 with eight selectable unique addressable bank groups, and hybrid RISC-CISC ISA?
The truth is out there, man, and comes down to cache coherency and a specialized, unique address space. HSA lives, man.
AMD for Life __ Intel for the Wife !
(I'm thinkin' you guys don't remember Thunder Man?)
HStewart - Wednesday, April 17, 2019 - link
Just calling Intel Chizillah gives no credit to your statements at all. This technology from Intel is the future and I would say one day it will either be replicated ( and AMD fans will say that AMD created it ) with HSA or what ever or they will be incline to use it.The Truth of matter is actually AMD is a clone of Intel architexture ( please don't go into 64 bit - that is just evolution of original x86 designed - it would have happen eventually )
Maybe Intel is for the wife, which is not a bad thing, but it takes a wife to create kids and that is where AMD main user based. In life one must grow up from one's childless ways. I would never say AMD is for life, in fact basing your life on computer chip is just wrong.
And Zen is not going change this, do you really think Intel is a sleeping giant - certainly not extinct dinosaur - when Sonny Coves comes out AMD fans will be crying monopoly and other bs.
sa666666 - Wednesday, April 17, 2019 - link
Hmm, the butthurt is strong with this one. And saying "don't go into 64 bit"; of course you don't want anyone saying it. It's one of the areas where AMD beats Intel, so a very sore spot for you. Christ, take a look at your post again and realize that it's you that has no life, being a cheerleader for a company that doesn't know you exist.Korguz - Wednesday, April 17, 2019 - link
HStewart.. umm Zen already changed this... look at what intel has HAD to release since Zen 1st came out... now all of a sudden.. intels main stream cpu's, have MORE then 4 cores... their HEDT chips, more cores... as sa666666 " dont go into 64bit " why not?? cause AMD brought it to the mainstream, NOT intel?? or how about the on die memory controller.. yep.. AMD, NOT intel... if AMD didnt bring 64 bit or the on die memory controller to the mainstream.. how long would it of taken intel to do it ?? considering how long intel kept quad cores in the mainstream, even now.. we may still be stuck in 32 bit, and the chipset handling the memory controller....HStewart reading the posts above this one.. as well as the ones below.. really do make you out as an intel fan boy.. some one mentions something good about AMD, you come in right after and bash them, and then prop up Intel....
" People like to give AMD more credit than they deserved. " actually.. amd deserves all the credit people are giving them... IMO.. AMD has done more for the cpu, since the Athlon 64 came out. then intel has in the same time frame...
mode_13h - Saturday, April 20, 2019 - link
In fact, AMD deserves even more credit, when you actually compare their size with Intel.evernessince - Sunday, April 21, 2019 - link
You haven't even seen it yet and you are already declaring it the future.You shit on AMD for bringing the first performant MCM design (there were many before but none good) to the market yet praise Intel for something you don't have the slightest idea of performance numbers on.
Just stop, we all know what camp you are in. For everyone else I'm going to wait to see the benchmarks before I declare it anything.
azrael- - Thursday, April 18, 2019 - link
But... but... but... it's all held together with IntelliGlue, so it's MUCH MUCH gooder than AMD.(The above statement might contain traces of sarcasm)
Xyler94 - Thursday, April 18, 2019 - link
You broke my sarcasm meter... do you have sarcasm insurance?Avengingsasquatch - Friday, April 19, 2019 - link
Ya and AMD has announced it will follow intels path with chip stacking ... niceKarmena - Wednesday, April 17, 2019 - link
Oh, Intel GLUE vs AMD GLUE. But Intel told that glueing chips together is not good.willis936 - Wednesday, April 17, 2019 - link
As time goes on that statement becomes more and more embarrassing. Maybe the non technical people should shut up and step down to avoid future embarrassment and let the people who have a clue do the decision making and speaking.Opencg - Wednesday, April 17, 2019 - link
now that people realize that its ALOT harder to fabricate chips closer to the physical limits of atom spacing everyone will be looking at chiplet tech as a way to reduce cost. the article outlines some of the steps that need to be taken. chiplets need to be validated before being attached to eachother. bandwidth speed and power requirements all have to be addressed. it doesnt happen overnight but amd was smart and started working on it many years ago. and its proven that even with todays fabrication techs the chiplet design has viability. mostly with many core server / workstation chips. but this year we will see navi. the difficulty of adapting the gpu to chiplet designs is much higher. likely it will take many more transistors for the same performance however the cost of those transistors will be lower.mode_13h - Saturday, April 20, 2019 - link
Unfortunately, while AMD pioneered HBM, it seems Nvidia was the first to really profit from it. In AMD's case, I wonder if it didn't just slow down the launch of Fury until Nvidia had something to launch against it ...that ended up being fast enough on plain GDDR5, no less.At least with EPYC, AMD enjoyed a good run before Intel joined the multi-die party.
Korguz - Sunday, April 21, 2019 - link
um i think they are kind of still enjoying that... if the program can use the cores.. isnt AMD's multi co cpu's ( ryzen, threadripper and epic ) all either on par, faster or close enough to intel that there is now an option if one doesnt want to play for the overpriced intel cpu's ?evernessince - Sunday, April 21, 2019 - link
Chiplets provide a lot more in addition to simple cost savings. WIth chiplets you can bin each one individually like what AMD does with threadripper to get very fast, power conservative, high core count chips. With monolithic chips you can't mix and match the best chiplets, you take the whole thing or leave it.In addition chiplets allow you to modularize the CPU. What AMD is doing with the IO die allows them to get bigger chips with a ton of cache and a ton of cores all with equal access to that cache.
At some point you should also see an active interposer, which according to a study done by the University of Toronto, can have superior core to core latency if done correctly. That makes sense of course, an active interposer would intelligently route data using the shortest path. That's not something monolithic CPUs can do as they do not have a interposer directly connecting all the CPU resources.
mode_13h - Monday, April 22, 2019 - link
With a monolithic CPU, you can actively route data through the CPU die. You don't need an active interposer. I'm not saying chiplets are bad - just that one point.nevcairiel - Wednesday, April 17, 2019 - link
People always misunderstood this comment, since it always was a tad bit sarcastic.Intel actually did the "Glue" thing first, back in the olden days. Back then, AMD made such Glue jokes about it. Then AMD adopts it a decade or so later, and Intel throws the joke back at them.
But because people have short memory, or could barely read the first time, the history is lost, and the cycle just repeats itself from the other side.
FreckledTrout - Wednesday, April 17, 2019 - link
If you recall AMD had the first dual core chip so Intel glued two CPU's together to make the Pentium Extreme Edition 840 which sold for $1000. In this case Intel really did take two existing CPU's and glue them together with zero thought. In my mind AMD's approach with Zen was fairly elegant and to market first but Intel's new EMIB approach is even better. Who knows by the time there are desktop parts using EMIB we could see AMD's approach evolve even further.evernessince - Sunday, April 21, 2019 - link
The difference being people do not comment about your statement 10 years ago being wrong. It's not that people have a short memory, it's that it would be stupid to lambast someone for something said so far back. If you are holding statements for 10 years to eventually laugh in someone's face I'd imagine you'd have zero friends.Intel is catching flak because within a year they announced they are now using glue. Time is the most important piece of context.
eva02langley - Wednesday, April 17, 2019 - link
Cascade Fail?rocky12345 - Wednesday, April 17, 2019 - link
I'm pretty sure the only one's that looked stupid was Intel for making such a stupid comment and followed up with nice little charts as well. So if you want to try to make someone look stupid then point that towards Intel for making such a tard statement back then. Of coarse people will keep rehashing that statement it was a point in time people will remember when Intel was so butt hurt and had very little to say about it so they lashed out with the glued together statement. It will stick with them for a very long time get used to it.HStewart - Wednesday, April 17, 2019 - link
I would not think of Intel packaging is like glue - think of like logical combine chip functional from different processes.But if gluing chip on motherboard - technology is getting smaller and smaller and people try to things with technology that original not designed for and then complain it faulty chip when they do it. This is even more important in mobile device which now they want to take there phone into pools and such.
wumpus - Thursday, April 18, 2019 - link
Maybe they are using Intel Glew?No, Krazy Glew appears to be working for Nvidia these days. Never mind.
eva02langley - Wednesday, April 17, 2019 - link
Once again, Intel talk and talk, but never walk.They dropped from the 5G modems, their GPU attempt in the pass were unsuccessful, their 10nm process is in shamble, their 14nm shortage is ridiculous, their Cascade lake is a joke at 350-400W.
So people believing Intel first try at chiplets, going straight through 3D stack, being successful, are mistaking. 10nm process proved that too many variables can be a real nightmare even for Intel. 3D stacking might be even worst than that.
This is PR for stealing AMD thunder, that's all I see.
TristanSDX - Wednesday, April 17, 2019 - link
what thunder ? Upcoming Ryzen 3XXX ? It will be barely 10% faster than current, when Sunny cove get out, gap will be wider than nowDragonstongue - Wednesday, April 17, 2019 - link
I guess you shook your magic 8ball and know this for fact?Also, if 10% across the board AND with even better reduction in power than early suggest either way this leans to AMD making ample ground in 2019, leading to 2020.
just like 18, 17.
The only :"modern" intel ones IMO the 9k series very few were zomg besides the pricing, AMD really has them in a lock step mode, Intel has massive costs and walks a tight line that AMD does not have to worry about nearly as much.....AMD on a roll, actually walking what they talk, Intel has been pretty much all talk, they should just release, see what folks say, then do the bios, drivers etc, as it stands, using the same to drastically more than AMD power wise, mehhh, I will take a 10% loss in performance for a good chunk less $$$ and power usage reduction benefits (when actually using it) any day
Targon - Wednesday, April 17, 2019 - link
Third generation Ryzen will see a 13-15 percent IPC boost, with the 7nm fab process allowing around a 20 percent boost to clock speeds. You put those together, and that's a big jump in one generation.Intel is counting on 10nm to help, but that will only allow for the potential for higher clock speeds, but IPC hasn't really improved, so claiming that Intel will magically get better due to statements...wasn't 10nm on track back in 2015? Yea, you can really trust what Intel says when it comes to release schedules or performance.
Tkan215215 - Thursday, April 18, 2019 - link
true if only with no ipc improvement it cannot help intel any further. Intel tdp is fantasy not come near real world measurementSantoval - Wednesday, April 17, 2019 - link
Taking into account Zen 2' guestimated IPC increase, the switch to the 7nm node and the associated power efficiency gains and higher clocks expected from the new node, Zen 2 should be at least 25% faster than the original Zen and 20% faster than Zen+ (unless AMD screwed up the design and TSMC screwed up the high performance variant of their 7nm node). "At least" as in "worst case scenario". More realistically, though still rather conservatively, I expected a 25% gain over Zen+ and a 30% gain over the original Zen.Santoval - Wednesday, April 17, 2019 - link
edit : "..I *expect* a 25% gain.."Targon - Wednesday, April 17, 2019 - link
If the high end can hit a 5.0GHz overclock on all cores, and first generation Ryzen was hitting 4.0GHz on all cores, that's a 25 percent boost in clock speed alone. The 13-15 percent IPC figure was over Zen+, which brought a 3-5 percent IPC boost over first generation Ryzen with it.So clock speeds alone if that 5.0GHz speed is correct would give the 25 percent boost, not even taking the IPC improvements into account.
evernessince - Sunday, April 21, 2019 - link
An engineer sample of Zen 2 did tie the 9900K. Given that original zen improves clocks by 23% from engineering sample to retail sample, I expect Zen 2 to do well.Korguz - Sunday, April 21, 2019 - link
may i ask.... where you read/saw this ? might be interesting to read :-)Opencg - Wednesday, April 17, 2019 - link
amd stated that ryzen 3 will offer on par performance with 9900k for gamers. this will be a first in a long time for amd to have a gaming cpu tie for the lead. i would take this claim seriously based on their radeon vii statements being pretty accurate. power requirements and pricing will be interesting though.Targon - Wednesday, April 17, 2019 - link
That was also with a demo unit that only had a single CCX. If AMD can pull off a 5.0GHz clock speed for all cores, and if AMD sets the price for 8 cores at the $330-$350 range, Intel is in for a rough time for the rest of the year. Yes, there are several, "if"s in there, because we can't know the exact specs until we get Computex leaks, and even then, there will be some doubt about the final performance numbers until review sites get a chance to test the release versions, but I expect that AMD can hit 8 [email protected] at the very least.mode_13h - Saturday, April 20, 2019 - link
As 7 nm matures, AMD can always move desktop chips back to a single-die. That's how I expect they'll answer Sunny Cove. In the meantime, they're getting out a 7 nm chip sooner than it might otherwise be viable.evernessince - Sunday, April 21, 2019 - link
Don't get your hopes up for Sunny Cove. Intel themselves said their first iteration of 10nm won't be better then 14++.It's going to be a minor bump in IPC and perhaps more cores. Anything more simply isn't in the cards. Yields are also going to be pretty bad for the 1st year.
DigitalFreak - Wednesday, April 17, 2019 - link
This is a repeat of the Athlon days. AMD caught Intel with their proverbial pants down and was able to take advantage of it. Intel eventually got it's act together and came back with the Core2 products, knocking AMD back down. I expect the same thing will happen again, since Intel has many more resources and money than AMD. It may take longer this time though, as AMD has a good product in Ryzen.Targon - Wednesday, April 17, 2019 - link
Intel is still playing games with retailers to keep AMD products out of product displays though. With the Intel CPU shortage, you can't honestly think that retailers are putting all-Intel displays out there for products that aren't there in large numbers without there being something improper going on.eva02langley - Wednesday, April 17, 2019 - link
This time around, AMD is taking all the growth opportunities that Intel is unable to block. This is the first time that AMD can finally pierce the mobile and server market and that Intel cannot do anything about it because they are unable to supply the OEM. Well, they have it coming. There is no turning back at this point.In 5 years, AMD could end up with 30-35% of the datacenter market. If they can manage similar figures in laptop, that would be disturbing.
As for gaming, well, it is over, all games are going to be developed on an AMD chip except Switch.
mode_13h - Saturday, April 20, 2019 - link
The big spoiler for AMD's plans would be ARM. ARM is gunning for both the laptop and server market, in a pretty big way. Still, AMD is such a small company that even 10% - 15% of the datacenter market would be huge, for them.Tkan215215 - Thursday, April 18, 2019 - link
intel wont they dont use mcm, amd has cheaper and faster cpuevernessince - Sunday, April 21, 2019 - link
Intel didn't get it's act together, it illegally bribed OEMs.Korguz - Sunday, April 21, 2019 - link
some think they are still doing this in some way....Opencg - Wednesday, April 17, 2019 - link
its more like pr for investors. if they had actual products comming for sure they would mention them. i think intel has a strong future but these things arent gonna be exciting to me personally for a long timeYB1064 - Wednesday, April 17, 2019 - link
This looks like a paid for advertisement.DannyH246 - Wednesday, April 17, 2019 - link
Agreed. IMO this has always been a heavily Intel biased site...However I enjoy a good laugh reading their marketing crap.
Irata - Wednesday, April 17, 2019 - link
Now this could entirely be a coincidence, but the timing is a bit curious - first you have the "Intel exits 5G smartphone modem market" news (as a side note), followed shortly after by "look what great technology Intel has planned"Ryan Smith - Thursday, April 18, 2019 - link
To be sure, this interview was a couple of weeks back. So this was scheduled well before Intel's modem woes.Korguz - Wednesday, April 17, 2019 - link
DannyH24 there are more intel bias sites out there... then anandtech.....Opencg - Wednesday, April 17, 2019 - link
its just a report based on info provided by the company. and while parsed differently and with extra comments / analysis its hard to really have a critical writeup on a companies far future plans. most of this stuff is in VERY early development.Ryan Smith - Thursday, April 18, 2019 - link
"This looks like a paid for advertisement. "I'm sorry to hear you think that. But none the less that's good feedback for us to hear, as that's how we improve things.
So as a bit of background, Intel reached out to us to have a chat about future packaging/interconnect technologies. There was no specific presentation, just a chance to ask about the current state of affairs.
Generally speaking these are some of the cooler articles we get to work on, both because we get a chance to talk to a division of a company we don't usually get to talk to, and because we get to focus a bit on the future as opposed to benchmarking yet another chip right now.
But if you think it sounds like a paid advertisement, that's good feedback to have. Is there anything you'd like to have seen done differently?
Opencg - Thursday, April 18, 2019 - link
this is actually a great article with great info. i think they were expecting something impossible based on emotional responses they have with companies failure / success. if you catered to that crowd then the articles would seem bloated with pointless bias.mode_13h - Saturday, April 20, 2019 - link
I actually like when you talk to people "down in the trenches". So, kudos on that. I thought there was some useful detail in there. It definitely helped me understand and appreciate the differences between interposer and EMIB. It's also interesting to know that they haven't been looking into cooling for Foveros, since that's clearly one of the challenges in using it to increase compute density. And it's not too surprising, but I didn't know that EMIB assembly and packaging basically needed to happen at Intel - even for outside chiplets.I think one point you're dancing around is that multi-die GPUs depend heavily on software innovations to improve data locality and load balancing. This should benefit even monolithic GPUs, as well.
Perhaps @YB1064 could cite which elements and aspects make it sound like an ad.
wumpus - Wednesday, April 17, 2019 - link
Some people appear to be missing the bigger questions this raises.GPU chiplets? As noted in the article, [multiple] GPU chiplets suck even more than CPU chiplets. While you might want a single mask to provide all your GPUs, it is unlikely to be worth the cost of the latency. What this means is that Intel believes that even if they ever get 10nm to work, yield will suck. Or it could just be Raj's pet project that will keep Intel GPUs in the "fallback when your graphics card dies" level.
Pretty sure this isn't Intel copying AMD (again). Much of this work has been done by Apple, Samsung and the big phone players, although they don't have the power/heat issues that Intel will wrestle with (ignoring heat with 3d stacking? Yeah, that's a smart move. Please ignore the elephant in the room, and pretend it is a spherical cow).
This all looks great from a server perspective, but I'm not sure it has any effect on the laptop/desktop world I'm familiar with. Perhaps eliminating the power/heat costs will help shore up their laptop line vs. the Ryzen 3000 onslaught, but all of this looks like they really don't care about that world (unless you really believe in multiple GPU chiplets).
While the "ARM invasion" of the server world has been more talk and the occasional pile of money burned to the ground, Intel knows that the ARM world has most of these issues solved (although less so for server-level heat generation). I'd suspect that this is making sure they are ready to compete with Ampere if they ever get a competitive CPU going.
- note that my disdain for GPU chiplets wouldn't apply to nvidia making a tesla compute "chiplet" that is ~700mm**2 (about as big as you can make a chip and the size of current HPC nvida "GPUs"). Since they already use an interposer (for the HBM) it might not be much of a stretch to glue a few together (building a chip much larger would suffer the same issues and might not have any yield at all).
Haawser - Wednesday, April 17, 2019 - link
Yeah, gaming chiplet GPUs ? Unlikely. But HPC 'GPU accelerators' using chiplets ? Think that's ~100% likely. From both Intel and AMD.eva02langley - Wednesday, April 17, 2019 - link
you are right on that, I just posted a quote from David Wang going over this.wumpus - Thursday, April 18, 2019 - link
My disdain is only for multiple GPU chiplets (unless said "chiplet" is *huge*).eva02langley - Wednesday, April 17, 2019 - link
Ignoring heat issues is proving that Intel is just trying to attract clueless investors.These heat issues will factor so much in the way the silicon is put together. So many physical variables are in play here and Intel just dismiss all of them. Guess what, stacking GPU chiplets is going to really work well for them if they are using liquid nitrogen... /sarcasm
mode_13h - Saturday, April 20, 2019 - link
You could stack GPU chiplets on memory, though. HBM stacking works because it's low-clocked and doesn't generate much heat.Opencg - Wednesday, April 17, 2019 - link
its hardly a bad idea though. the gpu has always been the parallel specialist. the idea of continued transistor counts past the moores law boundry is the most likely future for the gpu.Rudde - Thursday, April 18, 2019 - link
Nvidia DGX?mode_13h - Saturday, April 20, 2019 - link
Yes, it's a bit like a large, multi-die GPU. NVLink is cache coherent, so it can certainly be used in that way. It's up to software to make it scale efficiently.mode_13h - Saturday, April 20, 2019 - link
I think you're off the mark on your GPU latency concerns. Power and bottlenecks will be the main issue with multi-die GPUs.GPU architectures are already quite good at managing latency, which is why they can cope with higher-latency GDDR type memories. But shuffling around lots of data between dies could be a killer. So, what's really needed is software innovations - to improve locality and load balancing between the chiplets. Without that, a multi-die GPU would just be a slow, hot mess.
BTW, Nvidia has already spun their NVLink-connected multi-GPU server platforms as "the world's biggest GPU". Treating them as a single GPU indeed could give them a lead on the software front needed to enable efficient scaling with chiplet-based GPUs.
eva02langley - Wednesday, April 17, 2019 - link
Here you go... all talk... no walk..."Ramune did state that Intel has not specifically looked into advanced cooling methods for Foveros type chips, but did expect work in this field over the coming years, either internally or externally."
mode_13h - Saturday, April 20, 2019 - link
IMO, they should just turn the whole package into a big vapor chamber.eva02langley - Wednesday, April 17, 2019 - link
"When discussing products in the future, one critical comment did arise from our conversation. This might have been something we missed back at Intel’s Architecture Day in December last year, but it was reiterated that Intel will be bringing both EMIB and Foveros into its designs for future graphics technologies. "Intel is going to shoot themselves in the foot.
Wang on the other hand has something different to say in a post-Koduri led RTG, where he added: "To some extent you're talking about doing CrossFire on a single package. The challenge is that unless we make it invisible to the ISVs [independent software vendors] you're going to see the same sort of reluctance. We're going down that path on the CPU side, and I think on the GPU we're always looking at new ideas. But the GPU has unique constraints with this type of NUMA [non-uniform memory access] architecture, and how you combine features... The multithreaded CPU is a bit easier to scale the workload. The NUMA is part of the OS support so it's much easier to handle this multi-die thing relative to the graphics type of workload".
Read more: https://www.tweaktown.com/news/62244/amd-ready-mcm...
ats - Wednesday, April 17, 2019 - link
Lol, this is actually completely backwards. GPUs have significantly less communication requirements that the typical server CPU. In fact, GPUs are specifically designed with effectively fully silo'd block on chip. SMs for instance, don't have a lot of communication going between them. You could rather easily develop a multi-chip GPU with a central cache/memory die connecting to multiple pipeline modules. The SMs really only need to talk to the caches and memory. Latency overhead is pretty minimal as well, after all the whole design of GPUs is to tolerate massive latency required for texturing.mode_13h - Saturday, April 20, 2019 - link
I wouldn't be so dismissive of AMD's point. ISVs don't like multi-GPU, and a chiplet strategy looks a heck of a lot like that.And you really can't put all your memory controllers in one chiplet - that would be a huge bottleneck and wouldn't scale at all. GDDR already uses narrow channels, so it naturally distributes to the various chiplets.
The way to do it is to distribute work and data together, so each chiplet is mostly accessing data from its local memory pool. Using Foveros, you might even stack the chiplet on this DRAM.
ats - Sunday, April 21, 2019 - link
Not really. A multi-die strategy only looks like a multi-gpu strategy if the individual chiplets are exposed.And everyone already puts all their memory controllers in 1 die, fyi. You want them all in 1 die to reduce cross talk between dies. The workload is such that distributing the memory results in massive communication overheads.
mode_13h - Monday, April 22, 2019 - link
Hiding the fact that cores are on different dies doesn't make bottlenecks go away - it just hamstrings the ability of software to work around them.And when you say "everyone already puts all their memory controllers in 1 die", who exactly is "everyone"? So far, the only such example is Ryzen 3000, which is a CPU - not a GPU. GPUs require an order of magnitude more memory bandwidth. Furthermore, this amount must scale with respect to the number of chiplets, which doesn't happen if you have one memory controller die.
ats - Tuesday, April 23, 2019 - link
Who's hiding cores? No one cares where the shader blocks are as long as they are fed.And who is everyone? Nvidia and Radeon. We're talking GPUs here, fyi. 1 memory controller die is fine, it has rather easily connect to 512-1024 GB/s of bandwidth because existing GPUs already do that.
mode_13h - Tuesday, April 23, 2019 - link
You said "A multi-die strategy only looks like a multi-gpu strategy if the individual chiplets are exposed", which means you aim to obscure the fact that GPU cores aren't all on the same die. That's what I meant by "Hiding the fact that cores are on different dies".Anyway, I guess you're right and the industry is wrong for going NUMA. They are all idiots and you are the true genius. You should not waste your precious time and talent on here. Instead, go to Nvidia and tell them that they are on the wrong track. Then, hit up AMD and inform them that Wang (and probably also Lisa Su) doesn't really know how to design GPUs and you will step in and do it for them.
ats - Tuesday, April 23, 2019 - link
Um, by your standard of hiding cores, every GPU on the market hides cores. Programmers just generate a workflow, they don't need to know how many shader per SM or how many SMs per die.NUMA is and always has been a last resort. Having designed multiple NUMA systems that have shipped for commercial revenue from multiple different vendors and covering 3 different ISAs, I'm fairly confident in saying that. NUMA is never a goal and it is always something you try to minimize as much as possible. Instead it is a compromise you make to work within the given constraints.
For GPUs with chiplets, those constraints that would require a NUMA design largely don't exist. And a Rome style design is actually the better path for the high end, enthusiast, and enterprise/hpc markets. It allows you to push the technology edge for the actual compute components as fast as possible and keeps costs down while allowing you to use trailing edge process for the mem/interface component and should, give large die effects on freq/power, allow a lower power and over all more efficient design. This is largely enabled for GPUs by the rather limited (to non-existent) shader to shader communication requirements (which is after all the primary computational model advantage of GPUs vs CPUs).
Let me repeat that somewhat for emphasis: CPUs have both an expectation and practice of computation to computation communication while GPUs do not and instead have an expectation and practice communication generally only between computational phases (a complex way of saying that GPUs are optimized to read from memory, compute in silos, and write back to memory).
mode_13h - Wednesday, April 24, 2019 - link
> every GPU on the market hides cores.Well, that's the problem Wang cites.
> they don't need to know how many shader per SM or how many SMs per die.
Not if they're assuming a SMP memory model. Again, that's a problem, to the extent they control where things execute and where memory is allocated. I know OpenGL doesn't expose those details, but I know far less about Vulkan and DX12.
> GPUs ... have an expectation and practice communication generally only between computational phases
You're living in the past. Also, you underestimate how much communication there is between the stages of a modern graphics pipeline (which, contrary to how it may sound, execute concurrently for different portions of the scene). Current GPUs are converging cache-based, coherent memory subsystems.
https://images.nvidia.com/content/volta-architectu...
(see "Enhanced L1 Data Cache and Shared Memory")
And from https://en.wikichip.org/w/images/a/a1/vega-whitepa...
"To extract maximum benefit from Vega’s new cache hierarchy, all of the graphics blocks have been made clients of the L2 cache."
ats - Wednesday, April 24, 2019 - link
Yeah, I've read those. They don't support your argument nor do the actual programming models used support your argument. The second you try to treat GPU memory as dynamically coherent, they slow to a complete crawl.mode_13h - Wednesday, April 24, 2019 - link
You missed the point, which is that whether you like it or not, GPUs are adopting a cache-based memory hierarchy much like that of CPUs.Of course, they *also* have specialized, on-chip scratch pad memory (the Nvidia doc I referenced calls it "shared memory", while I think the operative AMD term is LDS). I don't expect that to disappear, but there's been a distinct shift towards becoming increasingly cache-dependent.
ats - Thursday, April 25, 2019 - link
GPUs have had a cache hierarchy from pretty much day 1, fyi. And no they are not adopting CPU like coherency.mode_13h - Saturday, April 27, 2019 - link
Without defining day 1, that's a tautology.You're also ignoring the fact that GPUs are increasing dependence on the cache hierarchy, increasing cache sizes, and increasing coherence support. AMD's Vega has coherent L2 and optionally coherent L1. From the Vega Whitepaper (linked above):
"“Vega” 10 is the first AMD graphics processor built using the Infinity Fabric interconnect that also underpins our "Zen" microprocessors. This low-latency, SoC-style interconnect provides coherent communication between on-chip logic blocks"
Also, as I cited above, the L1 cache in Volta was promoted to be much more CPU-like:
"Integration within the shared memory block ensures the Volta GV100 L1 cache has much lower latency and higher bandwidth than the L1 caches in past NVIDIA GPUs. The L1 In Volta functions as a high-throughput conduit for streaming data while simultaneously providing high-bandwidth
and low-latency access to frequently reused data—the best of both worlds."
So, to disregard changes in GPUs' cache organization and utilization so dismissively seems baldly disingenuous. I take it as a sign that you've nothing better to offer.
mode_13h - Wednesday, April 24, 2019 - link
> CPUs have both an expectation and practice of computation to computation communication while GPUs do not and instead have an expectation and practice communication generally only between computational phases (a complex way of saying that GPUs are optimized to read from memory, compute in silos, and write back to memory).The funny thing about this is that it actually argues *for* NUMA in GPUs and *against* NUMA in CPUs. So, thanks for that.
But it's also an overstatement, in the sense that CPU code which does too much global communication will scale poorly, and GPUs (and graphics shader languages) have long supported a full contingent of atomics, as graphics code stays a bit less in silos than you seem to think.
ats - Wednesday, April 24, 2019 - link
Um, then you don't understand it. Every shader block basically needs access to every memory blocks to maintain performance. Doing that in NUMA introduces massive bottlenecks.You start using a lot of atomics in GPU code, it comes crashing to a halt pretty quickly.
mode_13h - Wednesday, April 24, 2019 - link
You clearly know nothing about graphics programming. You can't design an architecture to solve a problem you don't understand.Those who introduced atomics in shaders clearly knew what they were doing and the corresponding performance impacts, but sometimes there's just no way around it. It's also laughable to suggest that there's no locality of reference, in graphics. In particular, to say that right after claiming that GPUs "compute in silos", is quite rich.
Keep 'em coming. I'll go make some popcorn!
mode_13h - Monday, April 22, 2019 - link
The funny thing is that the only example we really have of a multi-die GPU is Nvidi's ISC 2017 paper, cited in the article:https://images.anandtech.com/doci/14211/MGPU.png
Now, you can try to argue that Nvidia doesn't know what they're doing... but I wouldn't.
Scaling GPUs from 1-4 chiplets is good enough... at least for a first generation.
Rudde - Thursday, April 18, 2019 - link
You really have to make the memory shared between all chiplets. That is, less theoretical nVidia mcm (the picture in this article) and more Epyc Rome. Rome has a central io-hub that access memory. nVidia outlined a NUMA architecture.AMDs approach is probably too slow for gaming purposes, but could turn out to be a nice compute card.
ats - Thursday, April 18, 2019 - link
Shouldn't be too slow. You are adding a handful of latency to what is at best a 100+ ns latency path and upwards of 1000s of ns. In fact the asymmetrical design has a long history in graphic hardware design dating back decades. You only start to run into issues with symmetrical designs where every chip has to communicate with every other chip due to distributed memory.This whole too slow for gaming purposes is just laughable. You are realistically dealing with loaded memory latencies well over 100ns in a graphic chip doing a gaming workload. The die crossing is going to add single digit latencies to the path overall. Its not going to have any effect on something that is measured in ms esp since the whole design of a graphic shader pipeline is heavy designed from inception to be latency insensitive by necessity.
And honestly, in both cases of gpus and server cpus, Rome's design makes much more architectural sense than the previous epyc designs that were primarily constrained to allow 100% reuse of the consumer part in the server market.
mode_13h - Saturday, April 20, 2019 - link
The issue with shuffling data between dies is not latency - it's power and bottlenecks.mode_13h - Saturday, April 20, 2019 - link
I should qualify that as concerning GPUs, BTW.ats - Sunday, April 21, 2019 - link
Bottlenecks isn't much of an issue given an EMIB interconnect. Power will increase slightly, but could largely be made up from not having to deal with large die effects.mode_13h - Monday, April 22, 2019 - link
Focusing on EMIB misses the point. The point is you have a situation where all GPU cores from all dies are equidistant from all memory controllers. You're forcing the memory controller die to have a giant, multi-TB/sec crossbar. If you read any coverage of NVSwitch, you'd know it already burns a significant amount of power, and what you're talking about is even a multiple of that level of bandwidth.https://www.anandtech.com/show/12581/nvidia-develo...
According to https://www.nextplatform.com/2018/04/13/building-b...
"NVSwitch consumes less than 100 watts per chip."
Granted, a good amount of that is to drive the NVLink signals further distances than required for in-package signalling, but a non-trivial amount must be for the switching, itself.
To scale *efficiently*, the way to go is NUMA hardware + NUMA-aware software. This is the trend we've been seeing in the industry, over the past 2 decades. And it applies *inside* a MCM as well as outside the package, particularly when you're talking about GPU-scale bandwidth numbers. The only reason 7 nm EPYC can get away with it is because its total bandwidth requirements are far lower.
ats - Tuesday, April 23, 2019 - link
GPUs already have giant multi-TB/sec crossbars. How do you think they already connect the shader blocks to the memory controller blocks?NVSwitch is not even close to equiv to what we are creating here. We're doing a on package optimized interconnect and simply moving what would of been a massive xbar in a monolithic GPU into a separate die connecting to a bunch of smaller dies that are effectively shader blocks just like an Nvidia SM. You are grossly mixing technologies here. NVLink is a Meter+ external FR4 interconnect, it has basically nothing in common with an on package interconnect.
NVSwitch power should be almost all link based. I've actually designed switch/routers that have shipped in silicon in millions of devices, they don't actually take up that much power.
And no, NUMA doesn't make much sense for a GPU system. It significantly increases the overheads for minimal to no benefit. NUMA isn't a goal, it is a problem. Always has been always will be. If you don't have to go down that path, and you don't for GPUs, then you don't do it. We do it for CPUs because we need to scale coherently across multiple packages, which is not and has never been a GPU issue.
mode_13h - Wednesday, April 24, 2019 - link
> GPUs already have giant multi-TB/sec crossbars. How do you think they already connect the shader blocks to the memory controller blocks?There are obviously different ways to connect memory controllers to compute resources, such as via ring and mesh topologies. In Anandtech's Tonga review, they seem to call out Tahiti's use of crossbar as somewhat exceptional and costly:
"it’s notable since at an architectural level Tahiti had to use a memory crossbar between the ROPs and memory bus due to their mismatched size (each block of 4 ROPs wants to be paired with a 32bit memory channel). The crossbar on Tahiti exposes the cards to more memory bandwidth, but it also introduces some inefficiencies of its own that make the subject a tradeoff."
...
"The narrower memory bus means that AMD was able to drop a pair of memory controllers and the memory crossbar"
https://www.anandtech.com/show/8460/amd-radeon-r9-...
Also, reduction in the width of its intra-SMX crossbars is cited in their Maxwell 2 review as one of the keys to its doubling of perf/W over Kepler:
https://www.anandtech.com/show/8526/nvidia-geforce...
> NUMA isn't a goal, it is a problem. ... If you don't have to go down that path, ... then you don't do it.
No debate there. Nobody wants it for its own sake - people do it because it scales. Crossbars do not. To feed a 4-chiplet GPU, you're probably going to need 4 stacks of HBM2, which would seem to mean 32-channels (assuming 128 bits per channel). So, we're probably talking about something like a 32x32 port crossbar. Or take GDDR6, which uses 16-bit channels, meaning a memory data bus 512 bits wide would also have 32 channels. Nvidia's TU102 uses 384-bit, so I figure a proper multi-die GPU architecture should be capable of surpassing that.
One issue is that you're assuming GPUs don't already have some form of NUMA. Even CPUs, like Intel's previous-generation HCC Xeon dies and their current-gen server dies have memory controllers distributed throughout their topology in a way that limits connectivity andincreases latency non-uniformly. Why wouldn't they have just used a crossbar, if they're as cheap as you claim?
But the real win NUMA can offer is to keep non-shared memory accesses local. So, you don't burn power shuffling around more data than necessary, you don't risk excessive bank conflicts, and (bonus) you don't need enormous, hot switches with enough bandwidth to handle all of your memory traffic.
The only reason not to consider NUMA for what's essentially a supercomputer in a package is what Wang cites, which is that ISVs have gotten used to treating graphics cores as SMP nodes. So, you've got to either bring them along, or devise a NUMA architecture that behaves enough like SMP that their existing codebase doesn't take a performance hit.
ats - Wednesday, April 24, 2019 - link
WRT why don't CPUs use a xbar....Because their traffic patterns are fundamentally different than GPUs. GPUs don't have a lot to any cross communication between computational resources. Shader Block0 doesn't really care what Shader Block1 is doing, what memory it is updating, etc. Communication is basically minimal to the point that they generally have a separate special system to allow that communication in the rare cases it is required so that the requirement doesn't impact the primary path. In contrast, CPU's computational resources are in constant communication with each other. They are always maintaining global coherence throughout the entire memory and cache stack.
GPUs basically don't have non-shared memory access per se. That's what presents all the complexities with multi-gpu setups. Shader block0 is just as likely to need to access memory blockA or blockB as shader block1. For CPUs, there are plenty of workloads that as long as they maintain coherence, don't have a lot of overlap or competition for given memory regions (and designs like AMD's original EPIC/Ryzen design do best on these effectively shared nothing workloads).
GPUs fundamentally need UMA like access for graphics workloads.
mode_13h - Wednesday, April 24, 2019 - link
> GPUs don't have a lot to any cross communication between computational resources.That might be true of GPUs from a decade ago, but AMD's GCN has globally-coherent L2 and optionally-coherent L1 data cache.
http://developer.amd.com/wordpress/media/2013/06/2...
> Shader Block0 doesn't really care what Shader Block1 is doing, what memory it is updating, etc.
That's precisely the case for NUMA, in GPUs.
> GPUs basically don't have non-shared memory access per se. That's what presents all the complexities with multi-gpu setups.
Actually, the complexities with multi-GPU setups are largely due to the lack of cache coherency between GPUs and the orders of magnitude longer latencies and lower communication bandwidth than what we're talking about. It's no accident that NVLink and AMD's Infinity Fabric both introduce cache coherency.
> GPUs fundamentally need UMA like access for graphics workloads.
The term of art for which you're reaching would be SMP. You want all compute elements to have a symmetric view of memory. Unified Memory Architecture typically refers to sharing of a common memory pool between the CPU and GPU.
https://en.wikipedia.org/w/index.php?title=Unified...
ats - Thursday, April 25, 2019 - link
Go ahead an write a program that relies on inter shader coherence and see how that works for you...No it isn't the case for NUMA, because they all want access to all of memory, they just want read access. Make it NUMA and you'll be link limited in no time.
SMP is symmetrical multi processor. It is a term of art that means that the computational units are the same. It does not actually describe memory access granularity and complexity of the system. NUMA/UMA are the correct terms for referring to memory access granularity and locality.
mode_13h - Saturday, April 27, 2019 - link
> Go ahead an write a program that relies on inter shader coherence and see how that works for you...The same is true for multi-core CPUs. At least GPUs' deeper SMT implementation can keep shader resources from sitting idle, as long as there's enough other work to be done.
If current GPU shader programming languages had something akin to OpenCL's explicit memory hierarchy and work group structure, it would be easier for the runtime to allocate shared resources and schedule their usage closer to each other. That would be a big enabler for GPUs to go NUMA. That said, you can't eliminate global communication - you just need to reduce it to within what the topology can handle.
> No it isn't the case for NUMA, because they all want access to all of memory, they just want read access.
I don't know why you think CPUs are so different. Most software doesn't use core affinity, and OS schedulers have long been poor at keeping threads and memory resources together. Efficient scaling only happens when software and hardware adapt in a collaborative fashion. OpenCL was clearly a recognition of this fact.
Also, GPUs already sort of have the problem you describe, with ever narrower memory channels. The 16-bit channels of GDDR6 could easily get over-saturated with requests. The GPU shader cores just see this as more latency, which they're already well-equipped to hide.
> SMP is symmetrical multi processor. It is a term of art that means that the computational units are the same. It does not actually describe memory access granularity and complexity of the system.
Centralized memory is fundamental to the definition of SMP.
"Symmetric multiprocessing (SMP) involves a multiprocessor computer hardware and software architecture where two or more identical processors are connected to a single, shared main memory, have full access to all input and output devices, ..."
Smell This - Friday, April 19, 2019 - link
It remains to be seen as to whether AMD can pull this out.
They have danced around unified memory and cache coherency. Bringing it all together in a heterogeneous arch is the holy grail. The good news is AMD has been building for this moment for almost ten years, from the fusion APUs with SIMD/AVX 'graphic engines' to chiplet designs with independent I/O. This. Is. Hard. Stuff. We've come quite a-ways from the fusion 'onion and garlic' days.
Bank collisions, cache thrashing, etc, are the bad, old days of NUMA and page faults. Hopefully, Rome (and improved IOMMU) will incrementally move the ball forward toward a heterogeneous unified memory between cores/chiplets/bank groups, etc. Otherwise, we're stuck in a brute-force mindset.
Smell This - Friday, April 19, 2019 - link
Bring back bank 'swizzling' ?? Is that still a thing? :)mode_13h - Saturday, April 20, 2019 - link
The Rome approach won't scale. GPUs have far higher memory bandwidth requirements than CPUs. Nvidia's not stupid. They went with an EPYC (gen 1)-like architecture for good reasons.outsideloop - Wednesday, April 17, 2019 - link
Let me summarize. Intel: "Look at what we could possibly make in the future, so that you don't consider adopting 7nm Rome/Ryzen/Navi in two months.....Never mind that we will not have anything in hand until late 2020.....Just look at the monkey!"Tkan215215 - Thursday, April 18, 2019 - link
lol its intel again. intel fail 5g because cannot meet deadline what is different between 10nm and 5g. we will found this year if intel really has trustmode_13h - Saturday, April 20, 2019 - link
...can't help wondering whether Ramune was named after the Japanese beverage sold in bottles that are sealed with a glass ball:https://en.wikipedia.org/wiki/Ramune
...which, itself seems to be named after lemonade.
Charidemus - Sunday, April 21, 2019 - link
So now Intel is going to glue chips together? Funny think to use your bullshiting against AMD as your new product strategy...-Intel We are going to use chiplets...
-AMD Hold my beer
name99 - Monday, April 22, 2019 - link
People are too interested in jumping straight into AMD vs Intel to look at the big picture here:WHY is Intel pushing this tech?
Why would you design using chiplets rather than a single SoC? Under "normal" conditions, chiplets are going to be more expensive (packaging is never free) and more power (off chip, even via EMIB or interposer, is always more expensive than on-chip).
Some reasons that spring to mind:
- integrating very different technologies. This is, eg, the HBM issue, DRAM fabs are different from logic fabs.
- integrating IP from different companies. This is the Kaby-Lake G issue.
- you can't make a die the size you want. This is the AMD issue. It's not THAT interesting technically in that that's just a fancy PCB (much more expensive, much more performant); and the target market is much smaller. If the economics of yield rates vs interposer change, AMD shifts how they do multi-module vs larger die.
- OR the last reason, a variant of the first. To mix and match the same technology but at different nodes.
NOW let's ask the Intel question again. Why are they doing this? Well the only product they've told us about as using Foveros is Lakefield. Lakefield is a ~140mm^2 footprint. For reference, that's the same size as an A12X shipping on TSMC7nm.
Why use Foveros for this thing?
It's NOT that large.
It's NOT using fancy memory integration. (There is packaged DRAM, but just basic PoP like phones have been doing for years.)
It's not integrating foreign IP.
What it IS doing is creating some of the product on 10nm, and some on 22nm.
So why do that? Intel will tell you that it saves costs, or some other BS. I don't buy that at all.
Here's what I think is going on. It's basically another admission that 10nm has, uh, problematic, yields. The lousy yields mean Intel is building the bare minimum they have to on 10nm, which means a few small cores. What CAN be moved off chip (IO, PHYs, stuff like that) HAS been moved off, to shrink the 10nm die.
So I see this all as just one more part of Intel's ever-more-desperate EGOB, a desperate juggling of balls. Foveros is introduced as a great step forward, something that can do things current design can not. That's not actually true, not for a long while, but Intel has to keep extending the hype horizon to stay looking viable and leading edge.
Meanwhile they ALSO have to ship something, anything, in 10nm, to keep that ball in the air.
So marry the two! Use a basically inappropriate match (a design that would not make sense for Apple or QC, because it's more expensive and higher power than just making a properly designed SoC of the appropriate size) to allow shipping Lakefield in acceptable volumes, while at minimal die size! And as a bonus you can claim that you're doing something so awesome with new tech!
OK, so the Intel fans will all disagree, this is crazy. OK, sure, but ask yourselves:
- do you REALLY believe that it's cheaper to make two separate dies and package them together than to make a larger die? That goes against the economics of semiconductor logic that's held since the 70's. It ONLY works if the cost of 10nm is astonishingly high compared to 22nm (which is another way of saying terrible yield).
- If this is not being done because of 10nm yield, then why is it being done? To save 100mm^2 area? Give me a break! 1sq cm is essential on a watch. It's nice (but inessential) on a phone. It's irrelevant on a laptop. If you're so starved for area in a laptop, even a really compact one, there are simpler options like the sandwiched PCB used by the iPhone X, and now by (I think) the newest Samsung phones.
mode_13h - Tuesday, April 23, 2019 - link
Fair point. I hadn't realized Lakefield was so small.Of course, I'd argue that Kaby-G didn't make a heck of a lot of sense, either. Maybe it's just Intel trying to sell Foveros and drive their customized processor business.
Haawser - Thursday, April 25, 2019 - link
Why ? Simple. Because its still far cheaper than trying to make huge monolithic dies on increasingly smaller process nodes.You need to understand that certain defects like edge roughness don't scale. The smaller the lines you want to create, the bigger (in relation) such defects get. So as nodes get smaller they get increasingly hard to make defect free, irrespective of the size of the die.