Wow, yep, although Ryan leaves the door open in the article, it is clear HBM1 is limited to 1GB per stack with 4 stacks on the sample PCBs. How AMD negotiates this will be interesting.
Honestly it is looking more and more like Fiji is indeed Tonga XT x2 with HBM. Remember all those rumors last year when Tonga launched that it would be the launch vehicle for HBM? I guess it does support HBM, it just wasn't ready yet. Would also make sense as we have yet to see a fully-enabled Tonga ASIC; even though the Apple M295X has the full complement of 2048 SP, it doesn't have all memory controllers.
The 1024 bit wide bus of an HBM stack is composed of eight 128 bit wide channels. Perhaps only half of the channels need to be populated allowing for twice the number of stacks to reach 8 GB without changing the Fiji chip itself?
Electrical path latency is cut to ZERO. EP Latency is how many clocks are used moving the data over the length of the electrical path. That latecy is about a one clock.
M295X isn't only in Apple... I think Alienware has one to! XD
Yeah. Interesting, even Charlie points that out a lot. He also claims that developers laugh at needing over 4GB, which, may be true in some games... GTA V and also quite a few (very poor) games show otherwise.
Of course, how much you need to have ~60+ FPS, I don't know. I believe at 1440/1600p, GTA V at max doesn't get over 4GB. Dunno how lowering settings changes the VRAM there. So, to hit 60FPS at a higher res might require turning down the settings of CURRENT GAMES (not future games, those are the problem!) probably would fit *MOST* of them inside 4GB. I still highly doubt that GTA V and some others would fit, however. *grumble grumble*
Hope AMD is pulling wool over everyone's eyes, however, their presentation does indeed seem to limit it to 4GB.
GTA V taking over 4 GB if available and GTA V needing over 4 GB are two very different things. If it needed that memory then 980 SLI and 290X CF/the 295X would choke and die. They don't.
The 3.5GB 970 chokes early on 4K and needs feature reduction, the 980 allows more features, the Titan yet more features, in large part due to memory config.
Yeah it will be interesting how compression or new AA approaches lower memory usage but I will not buy a 4GB high end card now or in the future and depend on even more driver trickery to lower memory usage for demanding titles.
To substantiate my comment about driver trickery, this is a quote from TechReport's HBM article:
"When I asked Macri about this issue, he expressed confidence in AMD's ability to work around this capacity constraint. In fact, he said that current GPUs aren't terribly efficient with their memory capacity simply because GDDR5's architecture required ever-larger memory capacities in order to extract more bandwidth. As a result, AMD "never bothered to put a single engineer on using frame buffer memory better," because memory capacities kept growing. Essentially, that capacity was free, while engineers were not. Macri classified the utilization of memory capacity in current Radeon operation as "exceedingly poor" and said the "amount of data that gets touched sitting in there is embarrassing."
Strong words, indeed.
With HBM, he said, "we threw a couple of engineers at that problem," which will be addressed solely via the operating system and Radeon driver software. "We're not asking anybody to change their games.""
------------------------------
I don't trust them to deliver that on time and consistently.
lol yeah, hopefully they didn't just throw the same couple of engineers who threw together the original FreeSync demos together on that laptop, or the ones who are tasked with fixing the FreeSync ghosting/overdrive issues, or the FreeSync CrossFire issues, or the Project Cars/TW3 driver updates. You get the point hehe, those couple engineers are probably pretty busy, I am sure they are thrilled to have one more promise added to their plates. :)
Making something that is inefficient more efficient isn't "trickery," it's good engineering. And when the product comes out, we will be able to test it, so your trust is not required.
Wouldn't you see higher memory configs much like the 970 memory config 'fiasco' with greater than 4GB on another substrate or another entirely different configuration?
No. The current HBM stacks come in a fixed capacity, and the Fiji chip will only have so many lanes. Also, it is unlikely an OEM would venture into designing (and funding) their own interposer; this probably won't happen for at least a few years (if ever).
Actually an OEM can not design an Interposer with a memory controller. AMD owns that patent.
Interposer having embedded memory controller circuitry US 20140089609 A1 " For high-performance computing systems, it is desirable for the processor and memory modules to be located within close proximity for faster communication (high bandwidth). Packaging chips in closer proximity not only improves performance, but can also reduce the energy expended when communicating between the processor and memory. It would be desirable to utilize the large amount of "empty" silicon that is available in an interposer. "
AMD has pretty sewn up the concept of an Interposer being just a substarte with vias to stack and connect silicon.
Besides it would also be unlikely for OEM to be able to purchase unpackaged cpu or memory silicon for their own stacks. And why? Their manfacturing costs would be far higher
Don't forget the HBM1 vs. HBM2 change/upgrade that is coming. Will HMB2 show up late this year? Or early next year? Your guess. AMD will then be able to ship cards with twice the bandwidth--and four times the memory. My guess is that AMD plans a "mid-life kicker" for Fiji later this year taking it to 8 GBytes but still at HBM1 clock speeds. Then Greenland comes along with 16 Gig and HBM2 speeds.
BTW don't knock the color compression technology. It makes (slightly) more work for the GPU, but reduces memory and bandwidth requirements. When working at 4K resolutions and beyond, it becomes very significant.
GTA5 does go over 4GB at 1440p, as do a number of other next-gen games like Assassin's Creed Unity, Shadows of Mordor, Ryse, I am sure Witcher 3 does as well. 6GB is probably safe for this gen until 14/16nm FinFET, 8GB safest, 12GB if you want no doubts. We also don't know what DX12 is going to do to VRAM requirements.
Its not about fitting the actual frame buffer, its about holding and storing textures locally in VRAM so that the GPU has access them without going to System RAM or worst, local storage. Hi-Res 4K and 8K textures are becoming more common which increases storage footprint 4 fold and 16 fold over 2K, so more VRAM is always going to be welcome.
According to NVidia, without gameworks, 1440p at max settings 980 is the recommended card. And with gameworks Titan X/SLI 970.
2160p w/out gameworks recommend Titan X/SLI 980. Even at 2160p w/ Gameworks they still recommend 980 SLI.
Based on that my WAG is that TWIII uses under 4GB of VRAM at 2160. I'm guessing bringing Gameworks in pushes it just near the 4GB limit on 980. Probably in the 39xx range.
Can't say for sure as I don't have TW3 yet, but based on screenshots I wouldn't be surprised at all to see it break 4GB. In any case, games and drivers will obviously do what they can to work around any VRAM limitations, but as we have seen, it is not an ideal situation. I had a 980 and 290X long enough to know there were plenty of games dancing close enough to that 4GB ceiling at 1440p to make it too close for comfort.
TW3 doesn't even get close, highest VRAM usage I've seen is ~2.3GB @1440p everything ultra AA on etc. In fact of all the games you mentioned Shadows of Mordor is the only one that really pushes past 4GB @1440p in my experience (without unplayable levels of MSAA ). If that makes much difference to playability is another thing entirely, I've played Shadows on a 4GB card @1440p and it wasn't a stuttery mess or anything. It's hard to know without framerate/frametime testing if a specific game is using VRAM because it can or because is really requires it.
We've been through a period of rapid VRAM requirement expansion but I think things are going to plateau soon like they did with the ports from previous console generation.
I just got TW3 free with Nvidia's Titan X promotion and it doesn't seem to be pushing upward of 3GB, but the rest of the games absolutely do. Are you enabling AA? GTA5, Mordor, Ryse (with AA/SSAA), and Unity all do push over 4GB at 1440p. Also, any game that has heavy texture modding, like Skyrim, appreciates the extra VRAM.
Honestly I don't think we have hit the ceiling yet, the consoles are the best indication of this as they have 8GB of RAM, which is generally allocated as 2GB/6GB for CPU/GPU, so you are looking at ~6GB to really be safe, and we still haven't seen what DX12 will offer. Given many games are going to single large resources like megatextures, being able to load the entire texture to local VRAM would obviously be better than having to stream it in using advanced methods like bindless textures.
False, streaming from system RAM or slower resources is non-optimal compared to keeping it in local VRAM cache. Simply put, if you can eliminate streaming, you're going to get a better experience and more timely data accesses, plain and simple.
A quick check on hardOCP Shows max settings on Titan X with just under 4GB of RAM used at 1440p. To keep it playable, you had to turn down the settings slightly. http://www.hardocp.com/article/2015/05/04/grand_th...
You certainly can push VRAM usage over 4GB at 1440/1600p, but, generally speaking, it appears that it would push the game into not being fluid.
Having at least 6GB is 100% the safe spot. 4GB is pushing it.
Those aren't max settings, not even close. FXAA is being used, turn it up to just 2xMSAA or MFAA with Nvidia and that breaks 4GB easily.
Source: I own a Titan X and play GTA5 at 1440p.
Also, the longer you play, the more you load, the bigger your RAM and VRAM footprint. And this is a game that launched on last-gen consoles in 2013, so to think 4GB is going to hold up for the life of this card with DX12 on the horizon is not a safe bet, imo.
Do not forget the color compression that AMD designed into their chips. Its in Fuji. In addition, AMD assigned some engineers to work on ways to use the 4GB of memory more efficiently, since in the past AMD viewed the memory as free memory since it kept expanding, and wasn't really needed, so they had never bothered to assign anyone to make memory usage efficient. Now with a team having worked on that issue, which will work just with the drivers making changes to memory usage and allocation more efficiently, 4GB will be enough.
The Tonga XT x2 with HBM rumor is insane if you're suggesting the one I think you are. First off the chip has a GDDR memory controller, and second if the CF profile doesn't work out a 290X is a better card.
I do think its crazy but the more I read the more credibility there is to that rumor lol. Btw, memory controllers can often support more than 1 standard, not uncommon at all. In fact, most of AMD's APUs can support HBM per their own whitepapers and I do believe there was a similar leak last year that was the basis of the rumors Tonga would launch with HBM.
Wouldn't be the first time David Kanter was wrong, certainly won't be the last. Still waiting for him to recant his nonsense article about PhysX lacking SSE and only supporting x87. But I guess that's why he's David Kanter and not David ReKanter.
You're just making up stuff. No way Fiji is just two Tonga chips stuck together. My guess is your identity is wrapped up in nVidia so you need to spread fud.
That will be motivation enough to really improve on the chip for the next generation(s), not just rebrand it. Because to be honest very, very few people need 6 or 8GB on a consumer card today. It's so prohibitively expensive that you'd just have an experiment like the $3000 (now just $1600) 12GB Titan Z.
The fact that a select few can or would buy such a graphics card doesn't justify the costs that go into building such a chip, costs that would trickle down into the mainstream. No point in asking 99% of potential buyers to pay more to cover the development of features they'd never use. Like a wider bus, a denser interposer, or whatever else is involved in doubling the possible amount of memory.
Idk, I do think 6 and 8GB will be the sweet spot for any "high-end" card. 4GB will certainly be good for 1080p, but if you want to run 1440p or higher and have the GPU grunt to push it, that will feel restrictive, imo.
As for the expense, I agree its a little bit crazy how much RAM they are packing on these parts. 4GB on the 970 I thought was pretty crazy at $330 when it launched, but now AMD is forced to sell their custom 8GB 290X for only around $350-360 and there's more recent rumors that Hawaii is going to be rebranded again for R9 300 desktop with a standard 8GB. How much are they going to ask for it is the question, because that's a lot of RAM to put on a card that sells for maybe $400 tops.
Wow lol. That 4GB rumor again. And that X2 rumor again. And $849 price tag for just the single GPU version???! I guess AMD is looking to be rewarded for their efforts with HBM and hitting that ultra-premium tier? I wonder if the market will respond at that asking price if the single-GPU card does only have 4GB.
Artificial number like YZW FPS in games S,X,E will ;)
Do note that Nvidia need to pack lots of GB just for wide bus effect! It works for them, but games do not require 12GB now, nor in short term future (-- no consoles!)
@robinspi: Looks like Ryan Shrout at PCPer all but confirms 1xGPU Fiji will be limited to 4GB this round, Joe Macri at AMD was discussing it with him and all but confirms it:
http://www.pcper.com/reviews/General-Tech/High-Ban... "Will gaming suffer on the high end with only 4GB? Macri doesn’t believe so; mainly because of a renewed interest in optimizing frame buffer utilization. Macri admitted that in the past very little effort was put into measuring and improving the utilization of the graphics memory system, calling it “exceedingly poor.” The solution was to just add more memory – it was easy to do and relatively cheap. With HBM that isn’t the case as there is a ceiling of what can be offered this generation. Macri told us that with just a couple of engineers it was easy to find ways to improve utilization and he believes that modern resolutions and gaming engines will not suffer at all from a 4GB graphics memory limit. It will require some finesse from the marketing folks at AMD though…"
Looks like certain folks who trashed the 980 at launch for having only 4GB are going to have a tough time respinning their stories to fit an $850 AMD part with only 4GB.....
How are you so sure it wont' be $850? Stop getting all butthurt and maybe read the typical rumor sites that have gotten everything else to-date correct? 4GB HBM check. X2 check. Water cooled check. And today, multiple sources from these sites saying $850 and a new premium AMD GPU tier to try and compete with Titan.
that price doesn't make sense given the cost differences between GDDR5 and HBM once you take into account some cost savings that offset a portion of the added HBM cost.
I'm guessing if they found a way to make an 8GB version, it would be 800-900 dollars, as, that would eliminate the cost benefits of moving away from GDDR5 as far as I can tell.
a 4GB version I would expect to be 500-550 and 650-700 respectively. Well, to be honest, I personally think they will have 3 different core counts coming from Fiji, given the large cap in CUs from Hawaii to Fiji (given that it has 64 CU, which, everything still points towards)
Huh? Do you think HBM costs more than GDDR5 to implement, or not? There are minor savings on cheaper components/processes, like the PCB, but HBM could be 3-4-10x more expensive per GB; given historical new RAM pricing none of this is that far out there. We also know there's added complexity and cost with the interposer, and AMD is not putting expensive HBM on lower end parts, rebadges, or APUs. This all points to the fact the BoM is high and they are looking to be rewarded for their R&D.
In any case, keep hoping for an 8GB (single-GPU version), it seems pretty obvious the 4GB limits for HBM1 are true as AMD is now in full damage control mode saying 4GB is enough.
Heheh nah, always fun jabbing AMD fanboys like medi03 that I've gone back and forth with over the years, he's been really quiet lately, he may actually be disheartened by AMD's recent bad fortunes, which is uncommon for these die hard AMD fans!
No, not necessarily. AMD isn't exactly allaying any fears by remaining silent so far, but there's a method for chaining two HBM chips together, similar to how chip-select works in normal DDR RAM or SRAMs in embedded systems -- basically you have two chips sharing that 1024-bit memory bus, but there's a separate control signal that indicates which chip the host is talking to. In theory you can chain things along forever with enough chip selects, but a high-performance and highly-parallel bus like HBM is practically limited by signal-propagation latency and misalignment, so using just two chips per HBM bus is more of a practical limitation.
Nope, at least not according to my understanding. In fact, in theory, HBM1 can be configured, at reduced speeds, at well over 8GB. The article even mentions a technical bit of information pertaining to this:
"HBM in turn allows from 2 to 8 stacks to be used, with each stack carrying 1GB of DRAM."
From 2GB to 8GB right there, without any trickery. It appears HBM chips need to be used in pairs (otherwise a 2 chip minimum makes no sense), and likely needs to be addressed in pairs (with a 512-bit bus per chip, it would seem). This would indicate there is a two-bit address line which allows from one to four pairs to be individually addressed, or perhaps four binary address lines, whichever they deemed to be more economical and prudent. Either way it appears each stack has a 512-bit data bus.
If correct, you can even use a single 1024-bit bus and interleave on the bus and address 8GB @ 128GB/s maximum. A 2048-bit bus would limit at 16GB @ 256 GB/s, a 3072-bit bus could use 24GB @ 384GB/s, and a 4096-bit bus could use 32GB @ 512GB/s. Interleaving on the bus, though, would increase latency and decrease throughput.
That said, no company, especially not AMD, would design and then bet big on a memory technology that limited them to 4GB without having a solution ready. Everything I mentioned that the HBM chips would be required to support are standard for memory chips made for the last many many decades and was probably included even in the first rough draft for the command protocol without anyone even thinking about it twice. That's just how it works.
It might even be possible to use an 512-bit bus and some latching circuitry to drive HBM. You might even be able to do this with good performance and high capacities without modifying the memory chips at all.
All sounds really good in theory, unfortunately none of the (substantial) source material from AMD/Hynix supports this, nor do the comments from the AMD VP Macri who seems more or less resigned to the fact AMD is going forward with 4GB for HBM1.
But in any case, hopefully you won't be too disappointed if it is only 4GB.
Your comment made me remember that the standard was submitted to JEDEC.
JESD235 pertains to HBM (v1), from it I was able to determine that if 8GB was to be supported using 1GB stacks the command interface would have to be duplicated per chip, but the (much larger) data bus could be shared - with some important timing caveats, of course, but that is nothing new for memory controllers (in fact, that is most of what they do), but it is not necessarily something you'd want to do without having already had a working product using the memory technology... and certainly not something you'd bother implementing if you expected higher capacity chips to be available in a year's time...
I finally see how HBM works internally (something that's been lacking from most "technical" articles), and I see why its external interface doesn't follow convention - it's basically an 8/16 bank "up to 8 channel" collection of DRAM chips. Each channel can be addressed separately with a 128-bit data bus and can support 32Gb (4GB) of DRAM.
So HBM uses the relevant addressing lines internally, if at all (vendor specific), and doesn't provide for such a mechanism externally.
From what I'm seeing, it would seem you can build HBM with any width you want, in intervals of 128-bits. Of course, standards are designed to be flexible. That could mean lower powered devices could use 256bit HBM interfaces to save power... unless I'm totally missing something (which is quite likely, it isn't like reading a standards document is the same as reading a quick overview ;-)).
With Highbandwidth memory depth is not necessary. Of course only the benchmarks will actually show us.
And of course DX11 will be useless for this product. HBM was designed to solve a problem! DX12 solves the CPU bottleneck however DX12 benchmarks shows that performance scale sup nicely to 20MILLION + draw calls per second with 6 CPU cores feeding the GPU. When the CPU has 8 coress the performance flatlines and does not get anybetter.
ANAND demonstrated this quite clearly a few weeks back. However HBM will scael far beyond 6 cores as their is more through-put.
Of course that would mean that 390x must be benched using DX12 benchmarks. But that is what they were designed for: Mantle and DX12
HBM was designed to solve a problem that becomes apparent with DX12. DX11 does not support multithreaded and multicore gaming. DX12 enables ALL CPU cores to feed the GPU through Asynchronous Shader Pipelines and Asynchronous Compute Engines.
With DX12 GPU performance scales well to 6 cpu cores, beyond that and the GPU drawcall perfomance flatlines: GPU bottleneck. HBM will solve this problem.
DX11 is such a crippling API that anyone even using it to make a decision regarding a $1000 GPU purchase will lkely waste their money.
With DX12 Benching Radeon 390x with HBM will demostrate 400-500% performance increases over DX11.
Do you want to know the facts before you spend your money? Then demand DX12 benchmarks!!
According to AMD's Joe Macri, GDDR5 fed GPUs actually have too much unused memory today. Because to increase GPU memory bandwidth, wider memory interfaces are used. And because wider memory interfaces require a larger amount of GDDR5 memory chips, GPUs ended up with more memory capacity than is actually needed.Macri also stated that AMD invested a lot into improving utilization of the frame buffer. This could include on-die memory compression techniques which are integrated into the GPU hardware itself. Or more clever algorithms on the driver level."
HBM was designed to solve a problem that becomes apparent with DX12. DX11 does not support multithreaded and multicore gaming. DX12 enables ALL CPU cores to feed the GPU through Asynchronous Shader Pipelines and Asynchronous Compute Engines.
With DX12 GPU performance scales well to 6 cpu cores, beyond that and the GPU drawcall perfomance flatlines: GPU bottleneck. HBM will solve this problem.
DX11 is such a crippling API that anyone even using it to make a decision regarding a $1000 GPU purchase will lkely waste their money.
With DX12 Benching Radeon 390x with HBM will demostrate 400-500% performance increases over DX11.
Do you want to know the facts before you spend your money? Then demand DX12 benchmarks!!
Interesting. The article says that AMD is the only anticipated user of HBM1, but are there any rumors on where HBM2 might go?
Obvious thing is to make the stacks higher/denser (2-4GB per stack seems more suited to high-end 4K/VR gaming) and increasing the clocks on the interface.
Nvidia has already confirmed HBM2 support with Pascal (see the ref PCB on last page). I guess they weighed the pros/cons of low supply/high costs and limited VRAM on HBM1 and decided to wait until the tech matured. HBM1 also has significantly less bandwidth than what HBM2 claims (1+GB/s).
Probably part of it; but I suspect passing on HBM1 is part of the same more conservative engineering approach that's lead to nVidia launching on new processes a bit later than ATI has over the last few generations. Going for the next big thing early on potentially gives a performance advantage; but comes at a cost. Manufacturing is generally more expensive because early adopters end up having to fund more of the upfront expenses in building capacity, and being closer to the bleeding edge generally results in the engineering to make it work being harder. A dollar spend on fighting with bleeding edge problems is a either going to contribute to higher device costs; or to less engineering being able to optimize other parts of the design.
There's no right answer here. In some generations ATI got a decent boost from either a newer GDDR standard or GPU process. At other times, nVidia's gotten big wins from refining existing products; the 7xx/9xx series major performance/watt wins being the most recent example.
Idk, I think AMD's early moves have been pretty negligible. GDDR4 for example was a complete flop, made no impact on the market, Nvidia skipped it entirely and AMD moved off of it even in the same generation with the 4770. GDDR5 was obviously more important, and AMD did have an advantage with their experience with the 4770. Nvidia obviously took longer to get their memory controller fixed, but since then they've been able to extract higher performance from it.
And that's not even getting into AMD's proclivity to going to a leading edge process node sooner than Nvidia. Negligible performance benefit, certainly more efficiency (except when we are stuck on 28nm), but not much in the way of increased sales, profits, margins etc.
They probably also didn't have the engineering set up for it. *rollseyes* for NVidia's software superiority in the majority of cases, it is commonly accepted that AMD has far better physical design.
And, they also co-developed HBM. That probably doesn't hurt!
Nvidia probably wouldn't have gone with it anyways, but, I don't think they even had the option.
No the article covers it quite well, AMD tends to move to next-gen commodity processes as soon as possible in an attempt to generate competitive advantage, but unfortunately for them, this risk seldom pays off and typically increases their risk and exposure without any significant payoff. This is just another example, as HBM1 clearly has limitations and trade-offs related to capacity, cost and supply.
As for not having the option lol, yeah I am sure SK Hynix developed the process to pander it to only AMD and their measly $300M/quarter in GPU revenue.
Next gen process? What does that have to do with HBM again? There you lose me, even with that slight explanation.
Now, HBM has issues, but, supply isn't one of them. Capacity-- if AMD really can make an 8GB card (or 6GB card would be enough, really) are the real issues. Cost is a lesser one, it can be partially offset, so, the extra cost of HBM won't be extra cost eaten by AMD/added to the card. However, the cost will be higher than if the card had 4GB of GDDR5.
AMD *worked with* SK Hynix to develop this technology. This technology is going to be widely adopted. At least, SK Hynix believed that enough to be willing to push forward with it while only having AMD as a partner (it appears to me). There's obviously some merit with it.
How can you say HBM doesn't have supply/yield issues? You really can't say that, in fact, if it follows the rest of the DRAM industry's historical pricing, prices are going to be exponentially higher until they ramp for the mainstream.
This article already lists out a number of additional costs that HBM carries, including the interposer itself which adds complexity, cost and another point of failure to a fledgling process.
Lol @ Chizowshill doing what he does best, Nvidia troll carrot still visibly protruding, stenching out the Anandtech forums...thanks for the smiles dude.
Possibly, would make sense, and also explain why they are still going forward with it even if the 1st iteration isn't exactly optimal due to covered limitations (4GB, increased costs etc)
HBM2 is supposed to double the bandwidth and density so 8GB of ram and 1TB/sec.... for a 4 chip setup it also seems to allow upto 32GB but HBM2 it isn't supposed to be ready till Q2 2016
Which is fine as the big 16/14 nm FinFET next generation chips aren't due to till around then anyway. The memory technology and foundry plans are aligning rather well.
How does temperature affect TSV's and the silicon interposer? Continuous thermal cycling usually stresses out joints. Wouldn't want one of the many thousand joints to break.
If I understand it correctly, joints usually suffer from thermal cycling because they are between different materials that heat and cool at different rates. The TSVs will be connecting silicon to silicon, so presumably the heating and cooling will be uniform and not stress the joints in that way.
Nice article Ryan, I think this gets back to some of the general tech deep dives that a lot of people miss on AT, rather than the obligatory item reviews that I know you guys have to put out as well. Always interesting to read about new and upcoming technology, thanks for the read!
This part I think however, on last page needs to be clarified, as it is REALLY important to stay consistent in terminology now that GPU socket and PCB topology is changing:
"By AMD’s own estimate, a single HBM-equipped GPU package would be less than 70mm X 70mm (4900mm2), versus 110mm X 90mm (9900mm2) for R9 290X."
Even by AMD's own slide, that is *PCB area* occupied by either the HBM GPU package, or the GPU + GDDR5 modules. Calling everything a "package" doesn't really fit here and just confuses the issue if we keep the term Package intact, meaning GPU substrate sitting on PCB.
No, package is the correct term, as it is a single complete item that attaches to the PCB, including the GPU, RAM, interposer, etc all in one piece. It is not much different than the MCM (Multi Chip Modules) that many manufacturers (Intel for example) have used in the past. Since the memory is all on the package, the PCB area used is the same size as the package itself, in this case.
I agree the HBM package terminology is correct, but I'm not referring to that. I'm referring to the reference that the 290X package size is 110mm x 90mm for R9 290X. That's not very clear, because they are counting the *PCB AREA* on the 290X and using it synonymously with Package.
It would be more clearly stated if it read something like:
"By AMD’s own estimate, the PCB area occupied by a single HBM-equipped GPU package would be less than 70mm X 70mm (4900mm2), versus 110mm X 90mm (9900mm2) PCB area for R9 290X that includes the GPU package and GDDR5 modules."
How much does the production of the Interposer cost? It's obviously going to eat into AMDs margins, which would imply that unless they sell more product, their profits will actually decline. Likewise, I wonder if that extra cost is going to squeeze them on the low end, where they currently have an advantage.
I doubt gen 1 HBM will show up on budget cards; and wouldn't hold my breath on gen 2 or 3 either. For the 4xx generation, they're only putting it on the 490 family. 460-480 are going to remain at GDDR5. HBM will presumably kill off GDDR5 for midlevel cards over the next few years; but unless it becomes as cheap as DDR4 it's not going to be a factor on budget GPUs.
It'd actually make the board design simpler as difficult part, the memory traces, are now all in the interposer. The challenge for a dual GPU designs shifts toward power and cooling.
With the board area savings, they could conceptually do a triple GPU card. The problem wouldn't be the designs of such a card but actually getting enough power. Of course they could go out of the PCIe spec and go towards a 525W design for such a triple GPU beast.
The wonderful thing about having all the GPUs on a single board is that incorporating a private high speed bus between chips becomes possible to improve scaling. AMD attempted this before with the 4870X2: http://www.anandtech.com/show/2584/3
However, it was never really utilized as it was disabled via drivers.
Alternatively, multiple GPU dies and memory could just be placed onto the same the interposer. Having a fast and wide bus between GPU dies would then become trivial. Power consumption and more importantly power density, would not be so trivial.
This opens up a lot of possibilities. AMD could produce a CPU with a huge amount of on-package cache like Intel's crystalwell, but higher density.
For now it reinforces my opinion that the 10nm-class GPUs that are coming down the pipe in the next 12-16months are the ones that will really blow away the current generation. The 390 might match the Titan in gaming performance (when not memory-constrained) but it's not going to blow everything away. It will be comparable to what the 290x did to the 7970, just knock it down a peg instead of demolishing it.
CPUs demand low latency memory access, GPUs can hide the latency and require high bandwidth. Although I haven't seen anything specifically saying it, it seems to me that HMC is probably lower latency than HBM, and HBM may not be suitable for that system.
I'll agree that CPUs need a low latency path to memory.
Where I'll differ is on HMC. That technology does some serial-to-parallel conversion which adds latency into the design. I'd actually fathom that HBM would be the one with the lower latency.
I dont see this doing very well at all for AMD. yeilds are said to be low. and this is 1st gen. bound to be issues. Compund the fact we havent seen an OFFICIAL driver in 10 months, and poor performing games at release, I want to back the little guy. But I just can't. there are far too many risks. plus the 980 will come down in price when this comes out . making it a great deal. coupled with 1-3 months for new drivers from NV. you have yourself a much better platform. and as others mentioned. HBM2 will be where it is at as 4gb on a 4K capable card is pretty bad. So no point buying this card if 4gb is max. its a waste. 1440P fills 4gb. so it looks like AMD will still be selling 290X rebrands. and doing poorly on the flagship product. and Nvidia will continue to dominate discreet graphics. the 6gb 980Ti is looking like a sweet option till pascal. And for me me my R9 280 just doesnt have the Horsepower for my new 32 " samsung 1440p monitor. so I have to get something.
I'd expect that the reason for the long wait in drivers is getting the new generation's drivers ready. Also, what settings does 1440P fill 4 GB on? I don't see 980 SLI or the 295X tanking in performance, as they would if their memory was getting maxed.
DriverVer=11/20/2014, 14.501.1003.0000, with the catalyst package being 12/8/2014 - That would be 6mo. You're more than welcome to feel 6mo is too long, but it's not 10.
Typo page 2: "Tahiti was pushing things with its 512-bit GDDR5 memory bus." That should be Hawaii with a 512 bit wide bus or Tahiti with a 384 bit wide bus.
"First part of the solution to that in turn was to develop something capable of greater density routing, and that something was the silicon interposer. " "Moving on, the other major technological breakthrough here is the creation of through-silicon vias (TSVs). "
You guys are acting like interposers and TSV were created by AMD and Hynix for this, it's hugely misleading the way you chose to phrase things. And ofc, as always when you do this kind of article (Aptina, Synaptics, Logitech and a few more in the last few years), it's more advertising than anything else.You don't talk about other similar technologies ,existing or potential, you just glorify the one you are presenting.
This isn't an article on HBM itself but AMDs next gen cards. They are focusing on AMD becasue of that fact. If this were about HBM itself i'm sure they would talk about other technologies out their as well. Don't criticize because they are staying on topic in the article.
a side note for the article, ATI also was the main developer of GDDR3, with JEDEC helping a little. Nvidia launched with it first, but, ATI __DID__ most of the design work.
Having finished the article, I was also under the impression that high clock GDDR5 used 2-2.5 watts per chip on the board. I don't see what 7GBps GDDR5 with 50% more chips would use only 5% more power. (currently on graph 290 == 16 chips @5GBps, ~30W. Titan X = 24 chips @7GBps, ~31.5W).
Given AMD's ~15-20% for the 290x, I would put that at around 35-50W, while NVidia's solution is at least 50W. Of course, I could be wrong!
As a note, I get that you used the GDDR5 bandwidth/W you can get... However, that's likely at the best point in the pref/watt curve. I suspect that's under 5GBps, based on AMD's claimed GDDR5 consumption on the 290(X) and their memory clock.
Which, would put AMD's under that number, and, NVidia's further under that number.
They're rough estimates based on power consumption per bit of bandwidth and should be taken as such. Titan X has more chips, but it doesn't have to drive a wider memory bus.
So, should I assume that GDDR5 chips don't use power if you don't make a wider bus? And that 7GBps is the best Bandwidth/watt of GDDR5? Or that GDDR5 power consumption doesn't change when you raise or lower the clockspeed?
Nvidia's generalized power is just easier to calculate because they use 7GBps. Anyhow, my guesstimations for the 290x is that it uses is 32W given perfect power scaling from 5GBps to 7GBps and that it has less chips to run voltage.
The reality is probably AMD's is 40-50W and NVidia is 50-60W. Running more GDDR5 chips at higher clockspeeds, even on a smaller bus, should result in higher power usage.
I have rose tinted glasses, I also do have a brain.
It's quite the role-reversal, really. Back in the GT 200 days, NVIDIA were throwing out cards with wider memory buses, and AMD showed them that it was (mostly) unnecessary.
Whichever solution uses the most power for GDDR5 stands to gain the most with a move to HBM. I'd be interested in seeing how much juice the memory on a 12GB card uses...
Nvidia didn't really have a choice, GDDR5 was *barely* ready for the 4870 iirc. Nvidia would have had to hold back finished cards for months to be able to get GDDR5 on them. Actually, they would have had to take a bet on if GDDR5 would be ready for production at that point.
It isn't as simple as flipping a switch and having the GDDR5 controller work for GDDR3. It would require additional parts, leading to less dies per wafer and lower yield.
Nvidia did what was required to ensure their part would be able to get to market ASAP with enough memory bandwidth to drive it's shaders.
An exceptional amount of energy is spent on the bus and host controller, which is why GDDR power consumption is such a growing issue. At any rate, yes, more chips will result in increased power, but we don't have a more accurate estimation at this time. The primary point is that the theoretical HBM configuration will draw half the power (or less) of the GDDR5 configurations.
Honestly I'm most interested in seeing what this is going to do for card sizes. As the decreased footprints for a GPU+HBM stack in AMD's planning numbers or nVidia's Pascal prototype show there's a potential for significantly smaller cards in the future.
Water cooling enthusiasts look like big potential winners; a universal block would cover the ram too instead of just the GPU, and full coverage blocks could be significantly cheaper to manufacture.
I'm not so sure about the situation for air cooled cards though. Blower designs shouldn't be affected much; but no one really likes those. Open air designs look like they're more at risk though. If you shorten the card significantly you end up with only room for two fans on the heatsink instead of three; meaning you'd either have to accept reduced cooling or higher and louder fan speeds. That or have the cooler significantly overhang the PCB I suppose. Actually that has me wondering how or if being able to blow air directly through the heatsink instead of in the top and out the sides would impact cooling,
The open-air problem is only a problem if there is a new, smaller form factor for the video cards. OEM partners will likely make oversized heatsinks, or use a custom PCB to support more fans, just as they do now. With the reduced power envelope, I imagine the bulk of the OEM's will use the savings to make more compact designs, rather than use the energy savings to make higher performing designs.
I'm a bit skeptical that HBM will dramatically increase the performance the GPU. While it's true that this will help with high resolution rendering, there's also the fact that if the GPU wasn't up to snuff to begin with, it doesn't matter how much memory bandwidth you throw at it. But I'm willing to wait and see when this tech finally shows up at our store shelves before committing to any idea.
If anything, I'm only led to believe this will just solve memory bandwidth and power consumption issues for a while.
It won't on 28nm. Give you higher core clocks in a TPD, yes. HBM currently shows bandwidth scaling to at least 8TB/s from what I can tell... Which, is over 20 times the Titan X currently. Even if they can "only' hit half of that, it should supply more than enough bandwidth until 5nm process at least.
I agree, it is interesting though regardless, as 2.5D stacked RAM is clearly going to be the future of GPU memory, which will in turn drive different form factors, cooling solutions etc.
Great article. Sounds like custom water coolers may be shut out because the OEM cooler will be water cooled and there probably isn't enough improvement going to custom cooling.
I'm anxious to see the performance of a single 390x Fiji vs my 2 custom cooled R9 290s in CF.
This is fantastic. I mean, we cannot build any wider, so it is neat to see them finding ways to build upwards instead.
I would love to see a next gen device that pairs a card like this with HMC. Super fast mixed-use storage/memory combined with super fast GPU memory would make for a truly amazing combination.
Also, I don't see the 4GB limit being a big deal for mainstream high-end cards or laptops. It is only the ultra high-end enthusiast cards that might suffer in sales.
nVidia also simply has fewer good, long-term relationships to exploit than AMD has. The whole semi-conductor industry has been working with AMD for 45 years, whereas nVidia, run by Jen Hsun Huang, a former AMD employee, has only been around for about 15 years.
Name long term good relationships that Nvidia has had with other companies in the industry. Besides their board partners. You could argue TSMC either way. Otherwise, I'm getting nothing. They recently have a relationship with IBM that could become long term. It is entirely possible I'm just missing the companies they partner with that are happy with their partnership in the semi-conductor industry.
This analysis does not discuss the benefits of the base die? The base die contains the memory controller and a data serializer. The architecture of moving the memory controller to the base die simplifies the design and removes many bottlenecks. The Base die is large enough to support a large number of circuits. (#1 memory controller, #2 Cache, #3 data processing) The 4096 wires is a large number and 4096 I/O buffers is a large number. The area of 4096 I/O buffers on the GPU die is expensive, and this expense is easily avoided by placing the memory controller on the base die. The 70% memory Bus efficiency is idle bandwidth, and this idle data does not need to be sent back to the GPU. The 4096 Interposer signals reduces to (4096 * 0.7 = 2867) saving 1,229 wires + I/O buffers.
A simple 2 to 1 serializer would reduces down to (2867 * 0.50 = 1432). The Interposer wires are short enough to avoid the termination resistors for a 2GHz signal. Removing the termination resistors is top of the list to saving power, the second on the list to save power is to minimize the Row Activate.
So am I correct in assume then that the 295x2 equivalent performance numbers for Fiji leaked months ago are for the dual gpu variant? It concerns me that at no point in this write up did AMD even speculate what the performance inc with HBM might be.
Why is everyone concerned about the 4GB limit in VRAM? A few enthusiasts might be disappointed, but for anyone who isn't using multiple 4k monitors, 4GB is just fine. It might also be limiting in some HPC workloads, but why would any of us consumers care about that?
I guess the concern is that people were expecting AMD's next flagship to pick up where they left off on the high-end, and given how much AMD has touted 4K, that would be a key consideration. Also, there are the rumors this HBM part is $850 to create a new AMD super high-end, so yeah, if you're going to say 4K is off the table and try to sell this as a super premium 4K part, you're going to have a hard sell as that's just a really incongruent message.
In any case, AMD says they can just driver-magic this away, which is a recurring theme for AMD, so we will see. HBM's main benefits are VRAM to GPU transfers, but anything that doesn't fit in the local VRAM are still going to need to come from System RAM or worst, local storage. Textures for games are getting bigger than ever...so yeah not a great situation to be stuck at 4GB for anything over 1080p imo.
Do the R9 290\x really perform that much better with OC memory on the cards? I didnt think AMD was ever really constrained by bandwidth, as they usually always had more on their generation of cards. Consequently, I dont see 390\x being that much competition to Titan X
You have done an excellent job of displaying your level of intelligence. I don't think the New York Giants will provide much competition to the rest of the NFL this year. I won't support my prediction with any facts or theories just wanted to demonstrate that I am not a fan of the Giants.
Its only developed by them. Its a technology that is on the market now (or will be in 6 months after it stops being AMD exclusive). Its the same with GDDR3/5. ATI did lots of the work with developing it, but NV still had the option of using it.
Like any standards board or working group, you have a few heavy-lifters and everyone else leeches/contributes as they see fit, but all members have access to the technology in the hopes it drives adoption for the entire industry. Obviously the ones who do the most heavy-lifting are going to be the most eager to implement it. See: FreeSync and now HBM.
I do not agree with this article saying gpu's are memory bandwidth bottlenecked. If you don't believe me test it yourself. Keep gpu core clock at stock and maximize your memory oc and see the very little if any gains. Now put the memory at stock and maximize your gpu core oc and see the noticeable, decent gains.
HBM is still a very necessary step in the right direction. Being able to dedicate an extra 25-30 watts to the gpu core power budget is always a good thing. As 4k becomes the new standard and games upgrade their assets to take advantage of 4k we should start to see gddr5's bandwidth eclipsed, especially with multi monitor 4k setups. It's better to be ahead of the curve than playing catchup but the benefits you get from using HBM right now today are actually pretty minor.
In some ways it hurts amd as it forces us to pay more money for a feature we won't get much use out of. Would you rather pay 850 for a HBM 390x or 700 for a gddr5 390x with basically identical performance since memory bandwidth is still good enough for the next few years with gddr5.
I agree, bandwidth is not going to be the game-changer that many seem to think, at least not for gaming/graphics. For compute, bandwidth to the GPU is much more important as applications are constantly reading/writing new data. For graphics, the main thing you are looking at is reducing valleys and any associated stutters or drops in framerate as new textures are accessed by the GPU.
High Bandwidth is absolutely essential for the increased demand that DX12 is going to provide. With DX11 GPU's did not work very hard. Massive drawcalls are going to require massive rendering. That is where HBM is the only solution.
With DX11 the API overhead for a dGPU was around 2MILLION draw calls. With DX12 that changes radically to 15-20MILLION draw calls. All those extra polygons need rendering! how do you propose to do it with miniscule DDR4-5 pipes?
Just a note - the HBM solution seems to be more effective for high memory bandwidth loads. For low loads, the slower memory with higher parallelity might not be effective against the faster GDDR5
I understand that the article is primarily focussed on AMD as the innovator and GPU as the platform because of that. But once this is an open tech, and given the aggressive power budgeting now standard practice in motherboard/CPU/system design, won't there come a point at which the halving of power required means this *must* challenge standard CPU memory as well?
I just feel I'm missing here a roadmap (or even a single sidenote, really) about how this will play into the non-GPU memory market. If bandwidth and power are both so much better than standard memory, and assuming there isn't some other exotic game-changing technology in the wings (RRAM?) what is the timescale for switchover generally? Or is HBM's focus on bandwidth rather than pure speed the limiting factor for use with CPUs? But then, Intel forced us on to DDR4 which hasn't much improved speeds while increasing cost dramatically because of the lower operating voltage and therefore power efficiency... so there's definitely form in that transitioning to lower power memory solutions. Or is GDDR that much more power-hungry than standard DDR that the power saving won't materialise with CPU memory?
The non-GPU memory market is best described as TBD.
For APUs it makes a ton of sense, again due to the GPU component. But for pure CPUs? The cost/benefit ratio isn't nearly as high. CPUs aren't nearly as bandwidth starved, thanks in part to some very well engineered caches.
There's something that concerns me with this: Heat!
They push the benefits of a more compact card, but that also moves all the heat from the RAM right up next to the main core. The stacking factor of the RAM also scrunches their heat together, making it harder to dissipate.
The significant power reduction results in a significant heat reduction, but it still concerns me. Current coolers are designed to cover the RAM for a reason, and the GPUs currently get hot as hell. Will they be able to cool this combined setup reasonably?
I see you missed the part of the article that discusses how the entire package, RAM and GPU cores, will be covered by a heat spreader (most likely along with some heat transferring goo underneath to make everything level) that will make it easier to dissipate heat from all the chips together.
Similar to how Intel CPU packages (where there's multiple chips) used heat spreaders in the past.
You will wait 1 year for the second generation. The second generation chips will be a big improvement over the current chips. (Pascal = 8 X Maxwell). (R400 = 4 X R390)
HBM doesn't need the depth of memory that DDR4 or DDR5 does. DX12 performance is going to go through the roof.
HBM was designed to solve the GPU bottleneck. The electrical path latecy improvement is at least one clock not to mention the width of the pipes. The latency improvement will likely be 50% better. In and out. HBM will outperform ANYTHING that nVidia has out using DX12.
Use DX11 and you cripple your GPU anyway. You can only get 10% of the DX12 performance out of your system.
So get Windows 10 enable DX12 and buy this Card; by Christmas ALL games will be out DX12 capable as Microsoft is supporting DX12 with XBOX.
What if you put the memory controllers and the ROPs on the memory stack's base layer? You'd save more area for the GPU and have less data traffic going from the memory to the GPU.
Interposer having embedded memory controller circuitry US 20140089609 A1 " For high-performance computing systems, it is desirable for the processor and memory modules to be located within close proximity for faster communication (high bandwidth). Packaging chips in closer proximity not only improves performance, but can also reduce the energy expended when communicating between the processor and memory. It would be desirable to utilize the large amount of "empty" silicon that is available in an interposer. "
And a little light reading:
“NoC Architectures for Silicon Interposer Systems Why pay for more wires when you can get them (from your interposer) for free?” Natalie Enright Jerger, Ajaykumar Kannan, Zimo Li Edward S. Rogers Department of Electrical and Computer Engineering University of Toronto Gabriel H. Loh AMD Research Advanced Micro Devices, Inc” http://www.eecg.toronto.edu/~enright/micro14-inter...
It's only an advantage if you're product actually WINS the gaming benchmarks. Until then, there is NO advantage. And I'm not talking winning benchmarks that are purely used to show bandwidth (like 4k when running single digits fps etc), when games are still well below 30fps min anyway. That is USELESS. IF the game isn't playable at whatever settings your benchmarking that is NOT a victory. It's like saying something akin to this "well if we magically had a gpu today that COULD run 30fps at this massively stupid resolution for today's cards, company X would win"...LOL. You need to win at resolutions 95% of us are using and in games we are playing (based on sales hopefully in most cases).
Nvidia's response to anything good that comes of rev1 HBM will be, we have more memory (perception after years of built up more mem=better), and adding up to 512bit bus (from current 384 on their cards) if memory bandwidth is any kind of issue for next gen top cards. Yields are great on GDDR5, speeds can go up as shrinks occur and as noted Nvidia is on 384bit leaving a lot of room for even more memory bandwidth if desired. AMD should have went one more rev (at least) on GDDR5 to be cost competitive as more memory bandwidth (when GCN 1.2+ brings more bandwidth anyway) won't gain enough to make a difference vs. price increase it will cause. They already have zero pricing power on apu, cpu and gpu. This will make it worse for a brand new gpu.
What they needed to do was chop off compute crap that most don't use (saving die size or committing that size to more stuff we DO use in games), and improve drivers. Their last drivers were december 2014 (for apu or gpu, I know, I'm running them!). Latest beta on their site is 4/12. Are they so broke they can’t even afford a new WHQL driver every 6 months?
No day 1 drivers for witcher3. Instead we get complaints that gameworks hair cheats AMD and complaints CDPR rejected tressfx two months ago. Ummm, should have asked for that the day you saw witcher3 wolves with gameworks hair 2yrs ago, not 8 weeks before game launch. Nvidia spent 2yrs working with them on getting the hair right (and it only works great on the latest cards, screws kepler too until possible driver fixes even for those), while AMD made the call to CDPR 2 months ago...LOL. How did they think that call would go this late into development which is basically just spit and polish in the last months? Hairworks in this game is clearly optimized for maxwell at the moment but should improve over time for others. Turn down the aliasing on the hair if you really want a temp fix (edit the config file). I think the game was kind of unfinished at release TBH. There are lots of issues looking around (not just hairworks). Either way, AMD clearly needs to do the WORK necessary to keep up, instead of complaining about the other guy.
If AMD can put 4-8 gigs of HBM on a GPU then they can do the same with CPU's as well as APU's. All of the Patents that I am showing below reference 3d statcked memory
In fact one interesting quote from the patents listed below is this:
"Packaging chips in closer proximity not only improves performance, but can also reduce the energy expended when communicating between the processor and memory. It would be desirable to utilize the large amount of "empty" silicon that is available in an interposer."
AMD has plans to fill that empty silicon with much more memory.
The point: REPLACE SYSTEM DYNAMIC RAM WITH ON-DIE HBM 2 OR 3!
Eliminating the electrical path distance to a few millimeters from 4-8 centimeters would be worth a couple of clocks of latency. If AMD is building HBM and HBM 2 then they are also building HBM 3 or more!
Imagine what 64gb of HBM could do for a massive server die such as Zen? The energy savings alone would be worth it never mind the hugely reduced motherboard size, eliminating sockets and RAM packaging. The increased amount of CPU's/blade or mobo also reduces costs as servers can become much more dense.
Most folks now only run 4-8 gigs in their laptops or desktops. Eliminating DRAM and replacing it with HBM is a huge energy and mechanical savings as well as a staggering performance jump and it destroys DDR5. That process will be very mature in a year and costs will drop. Right now the retail cost of DRAM per GB is about $10. Subtract packaging and channel costs and that drops to $5 or less. Adding 4-8 GB of HBM has a very cheap material cost, likely the main expense is the process, testing and yields. Balance that against the energy savings MOBO real estate savings and HBM replacing system DRAM becomes even more likely without the massive leap in performance as an added benefit.
The physical cost savings is quite likely equivalent to the added process cost. Since Fiji will likely be released at a very competitive price point.
AMD is planning on replacing system DRAM memory with stacked HBM. Here are the Patents. They are all published last year and this year with the same inventor; Gabriel H. Loh and the assignee is of course AMD.
Stacked memory device with metadata management WO 2014025676 A1 "Memory bandwidth and latency are significant performance bottlenecks in many processing systems. These performance factors may be improved to a degree through the use of stacked, or three-dimensional (3D), memory, which provides increased bandwidth and reduced intra-device latency through the use of through-silicon vias (TSVs) to interconnect multiple stacked layers of memory. However, system memory and other large-scale memory typically are implemented as separate from the other components of the system. A system implementing 3D stacked memory therefore can continue to be bandwidth-limited due to the bandwidth of the interconnect connecting the 3D stacked memory to the other components and latency-limited due to the propagation delay of the signaling traversing the relatively-long interconnect and the handshaking process needed to conduct such signaling. The inter-device bandwidth and inter-device latency have a particular impact on processing efficiency and power consumption of the system when a performed task requires multiple accesses to the 3D stacked memory as each access requires a back-and-forth communication between the 3D stacked memory and thus the inter-device bandwidth and latency penalties are incurred twice for each access."
Interposer having embedded memory controller circuitry US 20140089609 A1 " For high-performance computing systems, it is desirable for the processor and memory modules to be located within close proximity for faster communication (high bandwidth). Packaging chips in closer proximity not only improves performance, but can also reduce the energy expended when communicating between the processor and memory. It would be desirable to utilize the large amount of "empty" silicon that is available in an interposer. "
Die-stacked memory device with reconfigurable logic US 8922243 B2 "Memory system performance enhancements conventionally are implemented in hard-coded silicon in system components separate from the memory, such as in processor dies and chipset dies. This hard-coded approach limits system flexibility as the implementation of additional or different memory performance features requires redesigning the logic, which design costs and production costs, as well as limits the broad mass-market appeal of the resulting component. Some system designers attempt to introduce flexibility into processing systems by incorporating a separate reconfigurable chip (e.g., a commercially-available FPGA) in the system design. However, this approach increases the cost, complexity, and size of the system as the system-level design must accommodate for the additional chip. Moreover, this approach relies on the board-level or system-level links to the memory, and thus the separate reconfigurable chip's access to the memory may be limited by the bandwidth available on these links."
Hybrid cache US 20140181387 A1 "Die-stacking technology enables multiple layers of Dynamic Random Access Memory (DRAM) to be integrated with single or multicore processors. Die-stacking technologies provide a way to tightly integrate multiple disparate silicon die with high-bandwidth, low-latency interconnects. The implementation could involve vertical stacking as illustrated in FIG. 1A, in which a plurality of DRAM layers 100 are stacked above a multicore processor 102. Alternately, as illustrated in FIG. 1B, a horizontal stacking of the DRAM 100 and the processor 102 can be achieved on an interposer 104. In either case the processor 102 (or each core thereof) is provided with a high bandwidth, low-latency path to the stacked memory 100. Computer systems typically include a processing unit, a main memory and one or more cache memories. A cache memory is a high-speed memory that acts as a buffer between the processor and the main memory. Although smaller than the main memory, the cache memory typically has appreciably faster access time than the main memory. Memory subsystem performance can be increased by storing the most commonly used data in smaller but faster cache memories."
Partitionable data bus US 20150026511 A1 "Die-stacked memory devices can be combined with one or more processing units (e.g., Central Processing Units (CPUs), Graphics Processing Units (GPUs), and Accelerated Processing Units (APUs)) in the same electronics package. A characteristic of this type of package is that it can include, for example, over 1000 data connections (e.g., pins) between the one or more processing units and the die-stacked memory device. This high number of data connections is significantly greater than data connections associated with off-chip memory devices, which typically have 32 or 64 data connections."
Non-uniform memory-aware cache management US 20120311269 A1 "Computer systems may include different instances and/or kinds of main memory storage with different performance characteristics. For example, a given microprocessor may be able to access memory that is integrated directly on top of the processor (e.g., 3D stacked memory integration), interposer-based integrated memory, multi-chip module (MCM) memory, conventional main memory on a motherboard, and/or other types of memory. In different systems, such system memories may be connected directly to a processing chip, associated with other chips in a multi-socket system, and/or coupled to the processor in other configurations. Because different memories may be implemented with different technologies and/or in different places in the system, a given processor may experience different performance characteristics (e.g., latency, bandwidth, power consumption, etc.) when accessing different memories. For example, a processor may be able to access a portion of memory that is integrated onto that processor using stacked dynamic random access memory (DRAM) technology with less latency and/or more bandwidth than it may a different portion of memory that is located off-chip (e.g., on the motherboard). As used herein, a performance characteristic refers to any observable performance measure of executing a memory access operation."
“NoC Architectures for Silicon Interposer Systems Why pay for more wires when you can get them (from your interposer) for free?” Natalie Enright Jerger, Ajaykumar Kannan, Zimo Li Edward S. Rogers Department of Electrical and Computer Engineering University of Toronto Gabriel H. Loh AMD Research Advanced Micro Devices, Inc” http://www.eecg.toronto.edu/~enright/micro14-inter...
“Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches” Gabriel H. Loh⋆ Mark D. Hill†⋆ ⋆ AMD Research † Department of Computer Sciences Advanced Micro Devices, Inc. University of Wisconsin – Madison” http://research.cs.wisc.edu/multifacet/papers/micr...
All of this adds up to HBM being placed on-die as a replacement of or maybe supplement to system memory. But why have system DRAM if you can build much wider bandwidth memory closer to the CPU on-die? Unless of course you build socketed HBM DRAM and a completely new system memory bus to feed it.
Replacing system DRAM memory with on-die HBM has the same benefits for the performance and energy demand of the system as it has for GPU's. Also it makes for smaller motherboards, no memory sockets and no memory packaging.
Of course this is all speculation. But it also makes sense.
With HBM in mind. Does AMD holds the patent for this? Is Nvidia just going to use HBM for free? Any one care to elaborate ? Because if Nvidia gets to use it for free then that's really funny for AMD side considering they are the one who research it and developed it. Am I making sense?
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
163 Comments
Back to Article
HighTech4US - Tuesday, May 19, 2015 - link
So Fiji is really limited to 4 GB VRAM.chizow - Tuesday, May 19, 2015 - link
Wow, yep, although Ryan leaves the door open in the article, it is clear HBM1 is limited to 1GB per stack with 4 stacks on the sample PCBs. How AMD negotiates this will be interesting.Honestly it is looking more and more like Fiji is indeed Tonga XT x2 with HBM. Remember all those rumors last year when Tonga launched that it would be the launch vehicle for HBM? I guess it does support HBM, it just wasn't ready yet. Would also make sense as we have yet to see a fully-enabled Tonga ASIC; even though the Apple M295X has the full complement of 2048 SP, it doesn't have all memory controllers.
Kevin G - Tuesday, May 19, 2015 - link
The 1024 bit wide bus of an HBM stack is composed of eight 128 bit wide channels. Perhaps only half of the channels need to be populated allowing for twice the number of stacks to reach 8 GB without changing the Fiji chip itself?akamateau - Thursday, May 28, 2015 - link
Electrical path latency is cut to ZERO. EP Latency is how many clocks are used moving the data over the length of the electrical path. That latecy is about a one clock.testbug00 - Tuesday, May 19, 2015 - link
M295X isn't only in Apple... I think Alienware has one to! XDYeah. Interesting, even Charlie points that out a lot. He also claims that developers laugh at needing over 4GB, which, may be true in some games... GTA V and also quite a few (very poor) games show otherwise.
Of course, how much you need to have ~60+ FPS, I don't know. I believe at 1440/1600p, GTA V at max doesn't get over 4GB. Dunno how lowering settings changes the VRAM there. So, to hit 60FPS at a higher res might require turning down the settings of CURRENT GAMES (not future games, those are the problem!) probably would fit *MOST* of them inside 4GB. I still highly doubt that GTA V and some others would fit, however. *grumble grumble*
Hope AMD is pulling wool over everyone's eyes, however, their presentation does indeed seem to limit it to 4GB.
xthetenth - Tuesday, May 19, 2015 - link
GTA V taking over 4 GB if available and GTA V needing over 4 GB are two very different things. If it needed that memory then 980 SLI and 290X CF/the 295X would choke and die. They don't.hansmuff - Tuesday, May 19, 2015 - link
The don't choke and die, but they also can't deliver 4K at max detail and that is *in part* because of 4GB memory. http://www.hardocp.com/article/2015/05/04/grand_th...The 3.5GB 970 chokes early on 4K and needs feature reduction, the 980 allows more features, the Titan yet more features, in large part due to memory config.
Yeah it will be interesting how compression or new AA approaches lower memory usage but I will not buy a 4GB high end card now or in the future and depend on even more driver trickery to lower memory usage for demanding titles.
hansmuff - Tuesday, May 19, 2015 - link
To substantiate my comment about driver trickery, this is a quote from TechReport's HBM article:"When I asked Macri about this issue, he expressed confidence in AMD's ability to work around this capacity constraint. In fact, he said that current GPUs aren't terribly efficient with their memory capacity simply because GDDR5's architecture required ever-larger memory capacities in order to extract more bandwidth. As a result, AMD "never bothered to put a single engineer on using frame buffer memory better," because memory capacities kept growing. Essentially, that capacity was free, while engineers were not. Macri classified the utilization of memory capacity in current Radeon operation as "exceedingly poor" and said the "amount of data that gets touched sitting in there is embarrassing."
Strong words, indeed.
With HBM, he said, "we threw a couple of engineers at that problem," which will be addressed solely via the operating system and Radeon driver software. "We're not asking anybody to change their games.""
------------------------------
I don't trust them to deliver that on time and consistently.
chizow - Tuesday, May 19, 2015 - link
lol yeah, hopefully they didn't just throw the same couple of engineers who threw together the original FreeSync demos together on that laptop, or the ones who are tasked with fixing the FreeSync ghosting/overdrive issues, or the FreeSync CrossFire issues, or the Project Cars/TW3 driver updates. You get the point hehe, those couple engineers are probably pretty busy, I am sure they are thrilled to have one more promise added to their plates. :)dew111 - Tuesday, May 19, 2015 - link
Making something that is inefficient more efficient isn't "trickery," it's good engineering. And when the product comes out, we will be able to test it, so your trust is not required.WinterCharm - Tuesday, May 19, 2015 - link
Exactly. And science and common sense have shown again and again, if you eliminate the bottlenecks, you can get significant performance gains.That's why SSD's are so great.
AndrewJacksonZA - Thursday, May 28, 2015 - link
What @dew111 said.i7 - Tuesday, May 19, 2015 - link
Wouldn't you see higher memory configs much like the 970 memory config 'fiasco' with greater than 4GB on another substrate or another entirely different configuration?dew111 - Tuesday, May 19, 2015 - link
No. The current HBM stacks come in a fixed capacity, and the Fiji chip will only have so many lanes. Also, it is unlikely an OEM would venture into designing (and funding) their own interposer; this probably won't happen for at least a few years (if ever).akamateau - Monday, June 8, 2015 - link
Actually an OEM can not design an Interposer with a memory controller. AMD owns that patent.Interposer having embedded memory controller circuitry
US 20140089609 A1
" For high-performance computing systems, it is desirable for the processor and memory modules to be located within close proximity for faster communication (high bandwidth). Packaging chips in closer proximity not only improves performance, but can also reduce the energy expended when communicating between the processor and memory. It would be desirable to utilize the large amount of "empty" silicon that is available in an interposer. "
AMD has pretty sewn up the concept of an Interposer being just a substarte with vias to stack and connect silicon.
Besides it would also be unlikely for OEM to be able to purchase unpackaged cpu or memory silicon for their own stacks. And why? Their manfacturing costs would be far higher
eachus - Friday, May 22, 2015 - link
Don't forget the HBM1 vs. HBM2 change/upgrade that is coming. Will HMB2 show up late this year? Or early next year? Your guess. AMD will then be able to ship cards with twice the bandwidth--and four times the memory. My guess is that AMD plans a "mid-life kicker" for Fiji later this year taking it to 8 GBytes but still at HBM1 clock speeds. Then Greenland comes along with 16 Gig and HBM2 speeds.BTW don't knock the color compression technology. It makes (slightly) more work for the GPU, but reduces memory and bandwidth requirements. When working at 4K resolutions and beyond, it becomes very significant.
chizow - Tuesday, May 19, 2015 - link
GTA5 does go over 4GB at 1440p, as do a number of other next-gen games like Assassin's Creed Unity, Shadows of Mordor, Ryse, I am sure Witcher 3 does as well. 6GB is probably safe for this gen until 14/16nm FinFET, 8GB safest, 12GB if you want no doubts. We also don't know what DX12 is going to do to VRAM requirements.Its not about fitting the actual frame buffer, its about holding and storing textures locally in VRAM so that the GPU has access them without going to System RAM or worst, local storage. Hi-Res 4K and 8K textures are becoming more common which increases storage footprint 4 fold and 16 fold over 2K, so more VRAM is always going to be welcome.
silverblue - Tuesday, May 19, 2015 - link
That compression had better be good, then.testbug00 - Tuesday, May 19, 2015 - link
According to NVidia, without gameworks, 1440p at max settings 980 is the recommended card. And with gameworks Titan X/SLI 970.2160p w/out gameworks recommend Titan X/SLI 980. Even at 2160p w/ Gameworks they still recommend 980 SLI.
Based on that my WAG is that TWIII uses under 4GB of VRAM at 2160. I'm guessing bringing Gameworks in pushes it just near the 4GB limit on 980. Probably in the 39xx range.
chizow - Tuesday, May 19, 2015 - link
Can't say for sure as I don't have TW3 yet, but based on screenshots I wouldn't be surprised at all to see it break 4GB. In any case, games and drivers will obviously do what they can to work around any VRAM limitations, but as we have seen, it is not an ideal situation. I had a 980 and 290X long enough to know there were plenty of games dancing close enough to that 4GB ceiling at 1440p to make it too close for comfort.Horza - Tuesday, May 19, 2015 - link
TW3 doesn't even get close, highest VRAM usage I've seen is ~2.3GB @1440p everything ultra AA on etc. In fact of all the games you mentioned Shadows of Mordor is the only one that really pushes past 4GB @1440p in my experience (without unplayable levels of MSAA ). If that makes much difference to playability is another thing entirely, I've played Shadows on a 4GB card @1440p and it wasn't a stuttery mess or anything. It's hard to know without framerate/frametime testing if a specific game is using VRAM because it can or because is really requires it.We've been through a period of rapid VRAM requirement expansion but I think things are going to plateau soon like they did with the ports from previous console generation.
chizow - Wednesday, May 20, 2015 - link
I just got TW3 free with Nvidia's Titan X promotion and it doesn't seem to be pushing upward of 3GB, but the rest of the games absolutely do. Are you enabling AA? GTA5, Mordor, Ryse (with AA/SSAA), and Unity all do push over 4GB at 1440p. Also, any game that has heavy texture modding, like Skyrim, appreciates the extra VRAM.Honestly I don't think we have hit the ceiling yet, the consoles are the best indication of this as they have 8GB of RAM, which is generally allocated as 2GB/6GB for CPU/GPU, so you are looking at ~6GB to really be safe, and we still haven't seen what DX12 will offer. Given many games are going to single large resources like megatextures, being able to load the entire texture to local VRAM would obviously be better than having to stream it in using advanced methods like bindless textures.
przemo_li - Thursday, May 21, 2015 - link
False.It would be better to ONLY stream what will be needed!
And that is why DX12/Vulkan will allow just that. App's will tell DX which part to stream.
Wholesale streaming will only be good if whole resource will be consumed.
This is benefit of bindless, only transfer what You will use.
chizow - Thursday, May 21, 2015 - link
False, streaming from system RAM or slower resources is non-optimal compared to keeping it in local VRAM cache. Simply put, if you can eliminate streaming, you're going to get a better experience and more timely data accesses, plain and simple.testbug00 - Tuesday, May 19, 2015 - link
A quick check on hardOCP Shows max settings on Titan X with just under 4GB of RAM used at 1440p. To keep it playable, you had to turn down the settings slightly.http://www.hardocp.com/article/2015/05/04/grand_th...
You certainly can push VRAM usage over 4GB at 1440/1600p, but, generally speaking, it appears that it would push the game into not being fluid.
Having at least 6GB is 100% the safe spot. 4GB is pushing it.
chizow - Tuesday, May 19, 2015 - link
Those aren't max settings, not even close. FXAA is being used, turn it up to just 2xMSAA or MFAA with Nvidia and that breaks 4GB easily.Source: I own a Titan X and play GTA5 at 1440p.
Also, the longer you play, the more you load, the bigger your RAM and VRAM footprint. And this is a game that launched on last-gen consoles in 2013, so to think 4GB is going to hold up for the life of this card with DX12 on the horizon is not a safe bet, imo.
Mark_gb - Sunday, May 24, 2015 - link
Do not forget the color compression that AMD designed into their chips. Its in Fuji. In addition, AMD assigned some engineers to work on ways to use the 4GB of memory more efficiently, since in the past AMD viewed the memory as free memory since it kept expanding, and wasn't really needed, so they had never bothered to assign anyone to make memory usage efficient. Now with a team having worked on that issue, which will work just with the drivers making changes to memory usage and allocation more efficiently, 4GB will be enough.xthetenth - Tuesday, May 19, 2015 - link
The Tonga XT x2 with HBM rumor is insane if you're suggesting the one I think you are. First off the chip has a GDDR memory controller, and second if the CF profile doesn't work out a 290X is a better card.chizow - Tuesday, May 19, 2015 - link
I do think its crazy but the more I read the more credibility there is to that rumor lol. Btw, memory controllers can often support more than 1 standard, not uncommon at all. In fact, most of AMD's APUs can support HBM per their own whitepapers and I do believe there was a similar leak last year that was the basis of the rumors Tonga would launch with HBM.tuxRoller - Wednesday, May 20, 2015 - link
David Kanter really seemed certain that amd was going to bring 8GB of HBM.chizow - Wednesday, May 20, 2015 - link
Wouldn't be the first time David Kanter was wrong, certainly won't be the last. Still waiting for him to recant his nonsense article about PhysX lacking SSE and only supporting x87. But I guess that's why he's David Kanter and not David ReKanter.Poisoner - Friday, June 12, 2015 - link
You're just making up stuff. No way Fiji is just two Tonga chips stuck together. My guess is your identity is wrapped up in nVidia so you need to spread fud.close - Tuesday, May 19, 2015 - link
That will be motivation enough to really improve on the chip for the next generation(s), not just rebrand it. Because to be honest very, very few people need 6 or 8GB on a consumer card today. It's so prohibitively expensive that you'd just have an experiment like the $3000 (now just $1600) 12GB Titan Z.The fact that a select few can or would buy such a graphics card doesn't justify the costs that go into building such a chip, costs that would trickle down into the mainstream. No point in asking 99% of potential buyers to pay more to cover the development of features they'd never use. Like a wider bus, a denser interposer, or whatever else is involved in doubling the possible amount of memory.
chizow - Tuesday, May 19, 2015 - link
Idk, I do think 6 and 8GB will be the sweet spot for any "high-end" card. 4GB will certainly be good for 1080p, but if you want to run 1440p or higher and have the GPU grunt to push it, that will feel restrictive, imo.As for the expense, I agree its a little bit crazy how much RAM they are packing on these parts. 4GB on the 970 I thought was pretty crazy at $330 when it launched, but now AMD is forced to sell their custom 8GB 290X for only around $350-360 and there's more recent rumors that Hawaii is going to be rebranded again for R9 300 desktop with a standard 8GB. How much are they going to ask for it is the question, because that's a lot of RAM to put on a card that sells for maybe $400 tops.
silverblue - Tuesday, May 19, 2015 - link
...plus the extra 30-ish watts of power just for having that extra 4GB. I can see why higher capacity cards had slightly nerfed clock speeds.przemo_li - Thursday, May 21, 2015 - link
VR.It require 90Hz 1090p x2 if one assume graphics same as current get non-VR graphics!
That is lots of data to push to and from GPU.
robinspi - Tuesday, May 19, 2015 - link
Wrong. They will be using a dual link interposer, making it instead of 4GB hi it will be 8GB hi-hi. Read more on WCCFTech:http://wccftech.com/amd-radeon-r9-390x-fiji-xt-8-h...
HighTech4US - Tuesday, May 19, 2015 - link
Wrong.4GB first, 8GB to follow (on dual GPU card)
http://www.fudzilla.com/news/graphics/37790-amd-fi...
chizow - Tuesday, May 19, 2015 - link
Wow lol. That 4GB rumor again. And that X2 rumor again. And $849 price tag for just the single GPU version???! I guess AMD is looking to be rewarded for their efforts with HBM and hitting that ultra-premium tier? I wonder if the market will respond at that asking price if the single-GPU card does only have 4GB.przemo_li - Thursday, May 21, 2015 - link
Artificial number like X GB, wont matter.Artificial number like YZW FPS in games S,X,E will ;)
Do note that Nvidia need to pack lots of GB just for wide bus effect!
It works for them, but games do not require 12GB now, nor in short term future (-- no consoles!)
chizow - Tuesday, May 19, 2015 - link
I guess we will find out soon enough!chizow - Tuesday, May 19, 2015 - link
@robinspi: Looks like Ryan Shrout at PCPer all but confirms 1xGPU Fiji will be limited to 4GB this round, Joe Macri at AMD was discussing it with him and all but confirms it:http://www.pcper.com/reviews/General-Tech/High-Ban...
"Will gaming suffer on the high end with only 4GB? Macri doesn’t believe so; mainly because of a renewed interest in optimizing frame buffer utilization. Macri admitted that in the past very little effort was put into measuring and improving the utilization of the graphics memory system, calling it “exceedingly poor.” The solution was to just add more memory – it was easy to do and relatively cheap. With HBM that isn’t the case as there is a ceiling of what can be offered this generation. Macri told us that with just a couple of engineers it was easy to find ways to improve utilization and he believes that modern resolutions and gaming engines will not suffer at all from a 4GB graphics memory limit. It will require some finesse from the marketing folks at AMD though…"
Looks like certain folks who trashed the 980 at launch for having only 4GB are going to have a tough time respinning their stories to fit an $850 AMD part with only 4GB.....
Crunchy005 - Tuesday, May 19, 2015 - link
how are you so sure it will be $850? Stop making stuff up before it comes out.chizow - Tuesday, May 19, 2015 - link
How are you so sure it wont' be $850? Stop getting all butthurt and maybe read the typical rumor sites that have gotten everything else to-date correct? 4GB HBM check. X2 check. Water cooled check. And today, multiple sources from these sites saying $850 and a new premium AMD GPU tier to try and compete with Titan.testbug00 - Tuesday, May 19, 2015 - link
that price doesn't make sense given the cost differences between GDDR5 and HBM once you take into account some cost savings that offset a portion of the added HBM cost.I'm guessing if they found a way to make an 8GB version, it would be 800-900 dollars, as, that would eliminate the cost benefits of moving away from GDDR5 as far as I can tell.
a 4GB version I would expect to be 500-550 and 650-700 respectively. Well, to be honest, I personally think they will have 3 different core counts coming from Fiji, given the large cap in CUs from Hawaii to Fiji (given that it has 64 CU, which, everything still points towards)
chizow - Wednesday, May 20, 2015 - link
Huh? Do you think HBM costs more than GDDR5 to implement, or not? There are minor savings on cheaper components/processes, like the PCB, but HBM could be 3-4-10x more expensive per GB; given historical new RAM pricing none of this is that far out there. We also know there's added complexity and cost with the interposer, and AMD is not putting expensive HBM on lower end parts, rebadges, or APUs. This all points to the fact the BoM is high and they are looking to be rewarded for their R&D.In any case, keep hoping for an 8GB (single-GPU version), it seems pretty obvious the 4GB limits for HBM1 are true as AMD is now in full damage control mode saying 4GB is enough.
medi03 - Tuesday, May 19, 2015 - link
Wow another AMD article and again nVidia trolls all over the place.chizow - Tuesday, May 19, 2015 - link
Well, it is always fun to watch AMD overpromise and underdeliver. Oops, was that a troll? :)Horza - Tuesday, May 19, 2015 - link
You know you are chiz that's why you responded to his comment in the first place!chizow - Wednesday, May 20, 2015 - link
Heheh nah, always fun jabbing AMD fanboys like medi03 that I've gone back and forth with over the years, he's been really quiet lately, he may actually be disheartened by AMD's recent bad fortunes, which is uncommon for these die hard AMD fans!ravyne - Tuesday, May 19, 2015 - link
No, not necessarily. AMD isn't exactly allaying any fears by remaining silent so far, but there's a method for chaining two HBM chips together, similar to how chip-select works in normal DDR RAM or SRAMs in embedded systems -- basically you have two chips sharing that 1024-bit memory bus, but there's a separate control signal that indicates which chip the host is talking to. In theory you can chain things along forever with enough chip selects, but a high-performance and highly-parallel bus like HBM is practically limited by signal-propagation latency and misalignment, so using just two chips per HBM bus is more of a practical limitation.looncraz - Tuesday, May 19, 2015 - link
Nope, at least not according to my understanding. In fact, in theory, HBM1 can be configured, at reduced speeds, at well over 8GB. The article even mentions a technical bit of information pertaining to this:"HBM in turn allows from 2 to 8 stacks to be used, with each stack carrying 1GB of DRAM."
From 2GB to 8GB right there, without any trickery. It appears HBM chips need to be used in pairs (otherwise a 2 chip minimum makes no sense), and likely needs to be addressed in pairs (with a 512-bit bus per chip, it would seem). This would indicate there is a two-bit address line which allows from one to four pairs to be individually addressed, or perhaps four binary address lines, whichever they deemed to be more economical and prudent. Either way it appears each stack has a 512-bit data bus.
If correct, you can even use a single 1024-bit bus and interleave on the bus and address 8GB @ 128GB/s maximum. A 2048-bit bus would limit at 16GB @ 256 GB/s, a 3072-bit bus could use 24GB @ 384GB/s, and a 4096-bit bus could use 32GB @ 512GB/s. Interleaving on the bus, though, would increase latency and decrease throughput.
That said, no company, especially not AMD, would design and then bet big on a memory technology that limited them to 4GB without having a solution ready. Everything I mentioned that the HBM chips would be required to support are standard for memory chips made for the last many many decades and was probably included even in the first rough draft for the command protocol without anyone even thinking about it twice. That's just how it works.
It might even be possible to use an 512-bit bus and some latching circuitry to drive HBM. You might even be able to do this with good performance and high capacities without modifying the memory chips at all.
chizow - Wednesday, May 20, 2015 - link
All sounds really good in theory, unfortunately none of the (substantial) source material from AMD/Hynix supports this, nor do the comments from the AMD VP Macri who seems more or less resigned to the fact AMD is going forward with 4GB for HBM1.But in any case, hopefully you won't be too disappointed if it is only 4GB.
looncraz - Wednesday, May 20, 2015 - link
Your comment made me remember that the standard was submitted to JEDEC.JESD235 pertains to HBM (v1), from it I was able to determine that if 8GB was to be supported using 1GB stacks the command interface would have to be duplicated per chip, but the (much larger) data bus could be shared - with some important timing caveats, of course, but that is nothing new for memory controllers (in fact, that is most of what they do), but it is not necessarily something you'd want to do without having already had a working product using the memory technology... and certainly not something you'd bother implementing if you expected higher capacity chips to be available in a year's time...
I finally see how HBM works internally (something that's been lacking from most "technical" articles), and I see why its external interface doesn't follow convention - it's basically an 8/16 bank "up to 8 channel" collection of DRAM chips. Each channel can be addressed separately with a 128-bit data bus and can support 32Gb (4GB) of DRAM.
So HBM uses the relevant addressing lines internally, if at all (vendor specific), and doesn't provide for such a mechanism externally.
From what I'm seeing, it would seem you can build HBM with any width you want, in intervals of 128-bits. Of course, standards are designed to be flexible. That could mean lower powered devices could use 256bit HBM interfaces to save power... unless I'm totally missing something (which is quite likely, it isn't like reading a standards document is the same as reading a quick overview ;-)).
chizow - Thursday, May 21, 2015 - link
Yep exactly, that's where the original 4GB limits for HBM1 came from originally, the JEDEC/Hynix source documents.akamateau - Thursday, May 28, 2015 - link
With Highbandwidth memory depth is not necessary. Of course only the benchmarks will actually show us.And of course DX11 will be useless for this product. HBM was designed to solve a problem! DX12 solves the CPU bottleneck however DX12 benchmarks shows that performance scale sup nicely to 20MILLION + draw calls per second with 6 CPU cores feeding the GPU. When the CPU has 8 coress the performance flatlines and does not get anybetter.
ANAND demonstrated this quite clearly a few weeks back. However HBM will scael far beyond 6 cores as their is more through-put.
Of course that would mean that 390x must be benched using DX12 benchmarks. But that is what they were designed for: Mantle and DX12
akamateau - Thursday, May 28, 2015 - link
You do not need the memory depth with HBM.HBM was designed to solve a problem that becomes apparent with DX12. DX11 does not support multithreaded and multicore gaming. DX12 enables ALL CPU cores to feed the GPU through Asynchronous Shader Pipelines and Asynchronous Compute Engines.
With DX12 GPU performance scales well to 6 cpu cores, beyond that and the GPU drawcall perfomance flatlines: GPU bottleneck. HBM will solve this problem.
DX11 is such a crippling API that anyone even using it to make a decision regarding a $1000 GPU purchase will lkely waste their money.
With DX12 Benching Radeon 390x with HBM will demostrate 400-500% performance increases over DX11.
Do you want to know the facts before you spend your money? Then demand DX12 benchmarks!!
akamateau - Thursday, May 28, 2015 - link
According to AMD's Joe Macri, GDDR5 fed GPUs actually have too much unused memory today. Because to increase GPU memory bandwidth, wider memory interfaces are used. And because wider memory interfaces require a larger amount of GDDR5 memory chips, GPUs ended up with more memory capacity than is actually needed.Macri also stated that AMD invested a lot into improving utilization of the frame buffer. This could include on-die memory compression techniques which are integrated into the GPU hardware itself. Or more clever algorithms on the driver level."http://wccftech.com/amd-addresses-capacity-limitat...
DX11 will not likely allow an HBM AIB to show much of an improvement in performance. Run DX12 games or benchmarks and HBM will rock that AIB!
akamateau - Thursday, May 28, 2015 - link
You do not need the memory depth with HBM.HBM was designed to solve a problem that becomes apparent with DX12. DX11 does not support multithreaded and multicore gaming. DX12 enables ALL CPU cores to feed the GPU through Asynchronous Shader Pipelines and Asynchronous Compute Engines.
With DX12 GPU performance scales well to 6 cpu cores, beyond that and the GPU drawcall perfomance flatlines: GPU bottleneck. HBM will solve this problem.
DX11 is such a crippling API that anyone even using it to make a decision regarding a $1000 GPU purchase will lkely waste their money.
With DX12 Benching Radeon 390x with HBM will demostrate 400-500% performance increases over DX11.
Do you want to know the facts before you spend your money? Then demand DX12 benchmarks!!
A5 - Tuesday, May 19, 2015 - link
Interesting. The article says that AMD is the only anticipated user of HBM1, but are there any rumors on where HBM2 might go?Obvious thing is to make the stacks higher/denser (2-4GB per stack seems more suited to high-end 4K/VR gaming) and increasing the clocks on the interface.
chizow - Tuesday, May 19, 2015 - link
Nvidia has already confirmed HBM2 support with Pascal (see the ref PCB on last page). I guess they weighed the pros/cons of low supply/high costs and limited VRAM on HBM1 and decided to wait until the tech matured. HBM1 also has significantly less bandwidth than what HBM2 claims (1+GB/s).DanNeely - Tuesday, May 19, 2015 - link
Probably part of it; but I suspect passing on HBM1 is part of the same more conservative engineering approach that's lead to nVidia launching on new processes a bit later than ATI has over the last few generations. Going for the next big thing early on potentially gives a performance advantage; but comes at a cost. Manufacturing is generally more expensive because early adopters end up having to fund more of the upfront expenses in building capacity, and being closer to the bleeding edge generally results in the engineering to make it work being harder. A dollar spend on fighting with bleeding edge problems is a either going to contribute to higher device costs; or to less engineering being able to optimize other parts of the design.There's no right answer here. In some generations ATI got a decent boost from either a newer GDDR standard or GPU process. At other times, nVidia's gotten big wins from refining existing products; the 7xx/9xx series major performance/watt wins being the most recent example.
chizow - Wednesday, May 20, 2015 - link
Idk, I think AMD's early moves have been pretty negligible. GDDR4 for example was a complete flop, made no impact on the market, Nvidia skipped it entirely and AMD moved off of it even in the same generation with the 4770. GDDR5 was obviously more important, and AMD did have an advantage with their experience with the 4770. Nvidia obviously took longer to get their memory controller fixed, but since then they've been able to extract higher performance from it.And that's not even getting into AMD's proclivity to going to a leading edge process node sooner than Nvidia. Negligible performance benefit, certainly more efficiency (except when we are stuck on 28nm), but not much in the way of increased sales, profits, margins etc.
testbug00 - Tuesday, May 19, 2015 - link
They probably also didn't have the engineering set up for it. *rollseyes* for NVidia's software superiority in the majority of cases, it is commonly accepted that AMD has far better physical design.And, they also co-developed HBM. That probably doesn't hurt!
Nvidia probably wouldn't have gone with it anyways, but, I don't think they even had the option.
chizow - Tuesday, May 19, 2015 - link
No the article covers it quite well, AMD tends to move to next-gen commodity processes as soon as possible in an attempt to generate competitive advantage, but unfortunately for them, this risk seldom pays off and typically increases their risk and exposure without any significant payoff. This is just another example, as HBM1 clearly has limitations and trade-offs related to capacity, cost and supply.As for not having the option lol, yeah I am sure SK Hynix developed the process to pander it to only AMD and their measly $300M/quarter in GPU revenue.
testbug00 - Tuesday, May 19, 2015 - link
Next gen process? What does that have to do with HBM again? There you lose me, even with that slight explanation.Now, HBM has issues, but, supply isn't one of them. Capacity-- if AMD really can make an 8GB card (or 6GB card would be enough, really) are the real issues. Cost is a lesser one, it can be partially offset, so, the extra cost of HBM won't be extra cost eaten by AMD/added to the card. However, the cost will be higher than if the card had 4GB of GDDR5.
AMD *worked with* SK Hynix to develop this technology. This technology is going to be widely adopted. At least, SK Hynix believed that enough to be willing to push forward with it while only having AMD as a partner (it appears to me). There's obviously some merit with it.
chizow - Tuesday, May 19, 2015 - link
HBM is that next-gen, commodity process....How can you say HBM doesn't have supply/yield issues? You really can't say that, in fact, if it follows the rest of the DRAM industry's historical pricing, prices are going to be exponentially higher until they ramp for the mainstream.
This article already lists out a number of additional costs that HBM carries, including the interposer itself which adds complexity, cost and another point of failure to a fledgling process.
testbug00 - Tuesday, May 19, 2015 - link
Because HBM doesn't bring any areas where you get to reduce cost.Currently, it does and will add a net cost. It also can reduce some costs. *yawn*
chizow - Thursday, May 21, 2015 - link
What? Again, do you think it will cost more, or not? lol.Ranger101 - Wednesday, May 20, 2015 - link
Lol @ Chizowshill doing what he does best, Nvidia troll carrot still visibly protruding,stenching out the Anandtech forums...thanks for the smiles dude.
chizow - Wednesday, May 20, 2015 - link
lol @ trollranger101 doing what he does best, nothing.at80eighty - Thursday, May 21, 2015 - link
check any thread - he is predictable.chizow - Thursday, May 21, 2015 - link
Yes, predictably clearing up, refuting, and debunking the misinformation spread by...you guessed it. AMD fanboys like yourselves.Intel999 - Tuesday, May 19, 2015 - link
AMD has six months exclusivity on HBM1 since they co created it with Hynix. That is why no one else is using it yet.chizow - Wednesday, May 20, 2015 - link
Possibly, would make sense, and also explain why they are still going forward with it even if the 1st iteration isn't exactly optimal due to covered limitations (4GB, increased costs etc)SunLord - Tuesday, May 19, 2015 - link
HBM2 is supposed to double the bandwidth and density so 8GB of ram and 1TB/sec.... for a 4 chip setup it also seems to allow upto 32GB but HBM2 it isn't supposed to be ready till Q2 2016Kevin G - Tuesday, May 19, 2015 - link
Which is fine as the big 16/14 nm FinFET next generation chips aren't due to till around then anyway. The memory technology and foundry plans are aligning rather well.testbug00 - Tuesday, May 19, 2015 - link
Appears HBM2 increase it from 1GB to up to 4GB (2-4GB).Page 12:
http://www.hotchips.org/wp-content/uploads/hc_arch...
hans_ober - Tuesday, May 19, 2015 - link
How does temperature affect TSV's and the silicon interposer? Continuous thermal cycling usually stresses out joints. Wouldn't want one of the many thousand joints to break.Mr Perfect - Tuesday, May 19, 2015 - link
If I understand it correctly, joints usually suffer from thermal cycling because they are between different materials that heat and cool at different rates. The TSVs will be connecting silicon to silicon, so presumably the heating and cooling will be uniform and not stress the joints in that way.chizow - Tuesday, May 19, 2015 - link
Nice article Ryan, I think this gets back to some of the general tech deep dives that a lot of people miss on AT, rather than the obligatory item reviews that I know you guys have to put out as well. Always interesting to read about new and upcoming technology, thanks for the read!This part I think however, on last page needs to be clarified, as it is REALLY important to stay consistent in terminology now that GPU socket and PCB topology is changing:
"By AMD’s own estimate, a single HBM-equipped GPU package would be less than 70mm X 70mm (4900mm2), versus 110mm X 90mm (9900mm2) for R9 290X."
Even by AMD's own slide, that is *PCB area* occupied by either the HBM GPU package, or the GPU + GDDR5 modules. Calling everything a "package" doesn't really fit here and just confuses the issue if we keep the term Package intact, meaning GPU substrate sitting on PCB.
extide - Tuesday, May 19, 2015 - link
No, package is the correct term, as it is a single complete item that attaches to the PCB, including the GPU, RAM, interposer, etc all in one piece. It is not much different than the MCM (Multi Chip Modules) that many manufacturers (Intel for example) have used in the past. Since the memory is all on the package, the PCB area used is the same size as the package itself, in this case.chizow - Tuesday, May 19, 2015 - link
I agree the HBM package terminology is correct, but I'm not referring to that. I'm referring to the reference that the 290X package size is 110mm x 90mm for R9 290X. That's not very clear, because they are counting the *PCB AREA* on the 290X and using it synonymously with Package.It would be more clearly stated if it read something like:
"By AMD’s own estimate, the PCB area occupied by a single HBM-equipped GPU package would be less than 70mm X 70mm (4900mm2), versus 110mm X 90mm (9900mm2) PCB area for R9 290X that includes the GPU package and GDDR5 modules."
gamerk2 - Tuesday, May 19, 2015 - link
How much does the production of the Interposer cost? It's obviously going to eat into AMDs margins, which would imply that unless they sell more product, their profits will actually decline. Likewise, I wonder if that extra cost is going to squeeze them on the low end, where they currently have an advantage.extide - Tuesday, May 19, 2015 - link
Probably on the order of $10/GPU -- not a ton, but enough to be a significant item on the BoM.DanNeely - Tuesday, May 19, 2015 - link
I doubt gen 1 HBM will show up on budget cards; and wouldn't hold my breath on gen 2 or 3 either. For the 4xx generation, they're only putting it on the 490 family. 460-480 are going to remain at GDDR5. HBM will presumably kill off GDDR5 for midlevel cards over the next few years; but unless it becomes as cheap as DDR4 it's not going to be a factor on budget GPUs.SunLord - Tuesday, May 19, 2015 - link
I wonder how this will impact dual gpu cardsKevin G - Tuesday, May 19, 2015 - link
It'd actually make the board design simpler as difficult part, the memory traces, are now all in the interposer. The challenge for a dual GPU designs shifts toward power and cooling.With the board area savings, they could conceptually do a triple GPU card. The problem wouldn't be the designs of such a card but actually getting enough power. Of course they could go out of the PCIe spec and go towards a 525W design for such a triple GPU beast.
testbug00 - Tuesday, May 19, 2015 - link
Naw, they would go for a full water cooled 725W card XD (1W under the electric limits of PCI + 8 + 8 pin iirc) trollololoolololol.don't think anyone would do a triple GPU card for consumers. The scaling is still pretty bad beyond 2 iirc.
Kevin G - Tuesday, May 19, 2015 - link
The wonderful thing about having all the GPUs on a single board is that incorporating a private high speed bus between chips becomes possible to improve scaling. AMD attempted this before with the 4870X2: http://www.anandtech.com/show/2584/3However, it was never really utilized as it was disabled via drivers.
Alternatively, multiple GPU dies and memory could just be placed onto the same the interposer. Having a fast and wide bus between GPU dies would then become trivial. Power consumption and more importantly power density, would not be so trivial.
Flunk - Tuesday, May 19, 2015 - link
This opens up a lot of possibilities. AMD could produce a CPU with a huge amount of on-package cache like Intel's crystalwell, but higher density.For now it reinforces my opinion that the 10nm-class GPUs that are coming down the pipe in the next 12-16months are the ones that will really blow away the current generation. The 390 might match the Titan in gaming performance (when not memory-constrained) but it's not going to blow everything away. It will be comparable to what the 290x did to the 7970, just knock it down a peg instead of demolishing it.
Kevin G - Tuesday, May 19, 2015 - link
Indeed and that idea hasn't been lost with other companies. Intel will be using a similar technology with the next Xeon Phi chip.See: http://www.eetimes.com/document.asp?doc_id=1326121
Yojimbo - Tuesday, May 19, 2015 - link
CPUs demand low latency memory access, GPUs can hide the latency and require high bandwidth. Although I haven't seen anything specifically saying it, it seems to me that HMC is probably lower latency than HBM, and HBM may not be suitable for that system.Kevin G - Tuesday, May 19, 2015 - link
I'll agree that CPUs need a low latency path to memory.Where I'll differ is on HMC. That technology does some serial-to-parallel conversion which adds latency into the design. I'd actually fathom that HBM would be the one with the lower latency.
nunya112 - Tuesday, May 19, 2015 - link
I dont see this doing very well at all for AMD. yeilds are said to be low. and this is 1st gen. bound to be issues.Compund the fact we havent seen an OFFICIAL driver in 10 months, and poor performing games at release, I want to back the little guy. But I just can't. there are far too many risks.
plus the 980 will come down in price when this comes out . making it a great deal. coupled with 1-3 months for new drivers from NV. you have yourself a much better platform. and as others mentioned. HBM2 will be where it is at as 4gb on a 4K capable card is pretty bad. So no point buying this card if 4gb is max. its a waste. 1440P fills 4gb. so it looks like AMD will still be selling 290X rebrands. and doing poorly on the flagship product. and Nvidia will continue to dominate discreet graphics.
the 6gb 980Ti is looking like a sweet option till pascal.
And for me me my R9 280 just doesnt have the Horsepower for my new 32 " samsung 1440p monitor. so I have to get something.
jabber - Tuesday, May 19, 2015 - link
Maybe should have waited on buying the monitor?WithoutWeakness - Tuesday, May 19, 2015 - link
No this is clearly AMD's fault.xthetenth - Tuesday, May 19, 2015 - link
I'd expect that the reason for the long wait in drivers is getting the new generation's drivers ready. Also, what settings does 1440P fill 4 GB on? I don't see 980 SLI or the 295X tanking in performance, as they would if their memory was getting maxed.Xenx - Tuesday, May 19, 2015 - link
DriverVer=11/20/2014, 14.501.1003.0000, with the catalyst package being 12/8/2014 - That would be 6mo. You're more than welcome to feel 6mo is too long, but it's not 10.Kevin G - Tuesday, May 19, 2015 - link
Typo page 2:"Tahiti was pushing things with its 512-bit GDDR5 memory bus." That should be Hawaii with a 512 bit wide bus or Tahiti with a 384 bit wide bus.
jjj - Tuesday, May 19, 2015 - link
"First part of the solution to that in turn was to develop something capable of greater density routing, and that something was the silicon interposer. ""Moving on, the other major technological breakthrough here is the creation of through-silicon vias (TSVs). "
You guys are acting like interposers and TSV were created by AMD and Hynix for this, it's hugely misleading the way you chose to phrase things.
And ofc, as always when you do this kind of article (Aptina, Synaptics, Logitech and a few more in the last few years), it's more advertising than anything else.You don't talk about other similar technologies ,existing or potential, you just glorify the one you are presenting.
Crunchy005 - Tuesday, May 19, 2015 - link
This isn't an article on HBM itself but AMDs next gen cards. They are focusing on AMD becasue of that fact. If this were about HBM itself i'm sure they would talk about other technologies out their as well. Don't criticize because they are staying on topic in the article.testbug00 - Tuesday, May 19, 2015 - link
a side note for the article, ATI also was the main developer of GDDR3, with JEDEC helping a little. Nvidia launched with it first, but, ATI __DID__ most of the design work.testbug00 - Tuesday, May 19, 2015 - link
Having finished the article, I was also under the impression that high clock GDDR5 used 2-2.5 watts per chip on the board. I don't see what 7GBps GDDR5 with 50% more chips would use only 5% more power. (currently on graph 290 == 16 chips @5GBps, ~30W. Titan X = 24 chips @7GBps, ~31.5W).Given AMD's ~15-20% for the 290x, I would put that at around 35-50W, while NVidia's solution is at least 50W. Of course, I could be wrong!
testbug00 - Tuesday, May 19, 2015 - link
As a note, I get that you used the GDDR5 bandwidth/W you can get... However, that's likely at the best point in the pref/watt curve. I suspect that's under 5GBps, based on AMD's claimed GDDR5 consumption on the 290(X) and their memory clock.Which, would put AMD's under that number, and, NVidia's further under that number.
testbug00 - Tuesday, May 19, 2015 - link
Oh, and, here the slide you have that "proves" it: http://images.anandtech.com/doci/9266/HBM_9_Compar...That means at 7GBps, at max bandwidth/watt, the Titan X should be using ~63 watts of power (28/10.66) * 24 = 63.04
Ryan Smith - Tuesday, May 19, 2015 - link
They're rough estimates based on power consumption per bit of bandwidth and should be taken as such. Titan X has more chips, but it doesn't have to drive a wider memory bus.HighTech4US - Tuesday, May 19, 2015 - link
Facts have never gotten in the way of testbug's anti-Nvidia drivel.testbug00 - Tuesday, May 19, 2015 - link
So, should I assume that GDDR5 chips don't use power if you don't make a wider bus? And that 7GBps is the best Bandwidth/watt of GDDR5? Or that GDDR5 power consumption doesn't change when you raise or lower the clockspeed?Nvidia's generalized power is just easier to calculate because they use 7GBps. Anyhow, my guesstimations for the 290x is that it uses is 32W given perfect power scaling from 5GBps to 7GBps and that it has less chips to run voltage.
The reality is probably AMD's is 40-50W and NVidia is 50-60W. Running more GDDR5 chips at higher clockspeeds, even on a smaller bus, should result in higher power usage.
I have rose tinted glasses, I also do have a brain.
silverblue - Tuesday, May 19, 2015 - link
It's quite the role-reversal, really. Back in the GT 200 days, NVIDIA were throwing out cards with wider memory buses, and AMD showed them that it was (mostly) unnecessary.Whichever solution uses the most power for GDDR5 stands to gain the most with a move to HBM. I'd be interested in seeing how much juice the memory on a 12GB card uses...
testbug00 - Tuesday, May 19, 2015 - link
Nvidia didn't really have a choice, GDDR5 was *barely* ready for the 4870 iirc. Nvidia would have had to hold back finished cards for months to be able to get GDDR5 on them. Actually, they would have had to take a bet on if GDDR5 would be ready for production at that point.It isn't as simple as flipping a switch and having the GDDR5 controller work for GDDR3. It would require additional parts, leading to less dies per wafer and lower yield.
Nvidia did what was required to ensure their part would be able to get to market ASAP with enough memory bandwidth to drive it's shaders.
silverblue - Wednesday, May 20, 2015 - link
Very true.testbug00 - Tuesday, May 19, 2015 - link
Each chip if GDDR5 has a voltage, correct? So, each additional chip consumes more power?Maybe I'm missing something.
Ryan Smith - Tuesday, May 19, 2015 - link
An exceptional amount of energy is spent on the bus and host controller, which is why GDDR power consumption is such a growing issue. At any rate, yes, more chips will result in increased power, but we don't have a more accurate estimation at this time. The primary point is that the theoretical HBM configuration will draw half the power (or less) of the GDDR5 configurations.Shadowmaster625 - Tuesday, May 19, 2015 - link
Take 2048 shaders, 16GB of HBM, 4 CPU cores and a PCH, slap it onto a pcb, and ship it.silverblue - Tuesday, May 19, 2015 - link
Don't forget the SSD. ;)Ashinjuka - Tuesday, May 19, 2015 - link
Ladies and gentlemen, I give you... The MacBook Hair.Crunchy005 - Tuesday, May 19, 2015 - link
Put it all under a IHS and a giant heat sync on top with one fan.mr_tawan - Wednesday, May 20, 2015 - link
I was about to say the samething :)DanNeely - Tuesday, May 19, 2015 - link
Honestly I'm most interested in seeing what this is going to do for card sizes. As the decreased footprints for a GPU+HBM stack in AMD's planning numbers or nVidia's Pascal prototype show there's a potential for significantly smaller cards in the future.Water cooling enthusiasts look like big potential winners; a universal block would cover the ram too instead of just the GPU, and full coverage blocks could be significantly cheaper to manufacture.
I'm not so sure about the situation for air cooled cards though. Blower designs shouldn't be affected much; but no one really likes those. Open air designs look like they're more at risk though. If you shorten the card significantly you end up with only room for two fans on the heatsink instead of three; meaning you'd either have to accept reduced cooling or higher and louder fan speeds. That or have the cooler significantly overhang the PCB I suppose. Actually that has me wondering how or if being able to blow air directly through the heatsink instead of in the top and out the sides would impact cooling,
jardows2 - Tuesday, May 19, 2015 - link
The open-air problem is only a problem if there is a new, smaller form factor for the video cards. OEM partners will likely make oversized heatsinks, or use a custom PCB to support more fans, just as they do now. With the reduced power envelope, I imagine the bulk of the OEM's will use the savings to make more compact designs, rather than use the energy savings to make higher performing designs.xenol - Tuesday, May 19, 2015 - link
I'm a bit skeptical that HBM will dramatically increase the performance the GPU. While it's true that this will help with high resolution rendering, there's also the fact that if the GPU wasn't up to snuff to begin with, it doesn't matter how much memory bandwidth you throw at it. But I'm willing to wait and see when this tech finally shows up at our store shelves before committing to any idea.If anything, I'm only led to believe this will just solve memory bandwidth and power consumption issues for a while.
testbug00 - Tuesday, May 19, 2015 - link
It won't on 28nm. Give you higher core clocks in a TPD, yes. HBM currently shows bandwidth scaling to at least 8TB/s from what I can tell... Which, is over 20 times the Titan X currently. Even if they can "only' hit half of that, it should supply more than enough bandwidth until 5nm process at least.So, at least 10 years, more likely 15-20.
chizow - Tuesday, May 19, 2015 - link
I agree, it is interesting though regardless, as 2.5D stacked RAM is clearly going to be the future of GPU memory, which will in turn drive different form factors, cooling solutions etc.der - Tuesday, May 19, 2015 - link
It'll bring a lot!guskline - Tuesday, May 19, 2015 - link
Great article. Sounds like custom water coolers may be shut out because the OEM cooler will be water cooled and there probably isn't enough improvement going to custom cooling.I'm anxious to see the performance of a single 390x Fiji vs my 2 custom cooled R9 290s in CF.
CaedenV - Tuesday, May 19, 2015 - link
This is fantastic. I mean, we cannot build any wider, so it is neat to see them finding ways to build upwards instead.I would love to see a next gen device that pairs a card like this with HMC. Super fast mixed-use storage/memory combined with super fast GPU memory would make for a truly amazing combination.
Also, I don't see the 4GB limit being a big deal for mainstream high-end cards or laptops. It is only the ultra high-end enthusiast cards that might suffer in sales.
menting - Tuesday, May 19, 2015 - link
Just to be clear..HBM and HMC are not the same (but they are fairly similar in a lot of areas)anubis44 - Tuesday, May 19, 2015 - link
nVidia also simply has fewer good, long-term relationships to exploit than AMD has. The whole semi-conductor industry has been working with AMD for 45 years, whereas nVidia, run by Jen Hsun Huang, a former AMD employee, has only been around for about 15 years.HighTech4US - Tuesday, May 19, 2015 - link
What drugs are you on today?testbug00 - Tuesday, May 19, 2015 - link
Name long term good relationships that Nvidia has had with other companies in the industry. Besides their board partners. You could argue TSMC either way. Otherwise, I'm getting nothing. They recently have a relationship with IBM that could become long term. It is entirely possible I'm just missing the companies they partner with that are happy with their partnership in the semi-conductor industry.Compared to IBM, TSMC, SK Hynix, and more.
ImSpartacus - Tuesday, May 19, 2015 - link
Can we have an interview with Joe Macri? He seems like a smart fella if he was the primary reference for this article.wnordyke - Tuesday, May 19, 2015 - link
This analysis does not discuss the benefits of the base die? The base die contains the memory controller and a data serializer. The architecture of moving the memory controller to the base die simplifies the design and removes many bottlenecks. The Base die is large enough to support a large number of circuits. (#1 memory controller, #2 Cache, #3 data processing)The 4096 wires is a large number and 4096 I/O buffers is a large number. The area of 4096 I/O buffers on the GPU die is expensive, and this expense is easily avoided by placing the memory controller on the base die. The 70% memory Bus efficiency is idle bandwidth, and this idle data does not need to be sent back to the GPU. The 4096 Interposer signals reduces to (4096 * 0.7 = 2867) saving 1,229 wires + I/O buffers.
A simple 2 to 1 serializer would reduces down to (2867 * 0.50 = 1432). The Interposer wires are short enough to avoid the termination resistors for a 2GHz signal. Removing the termination resistors is top of the list to saving power, the second on the list to save power is to minimize the Row Activate.
takeship - Tuesday, May 19, 2015 - link
So am I correct in assume then that the 295x2 equivalent performance numbers for Fiji leaked months ago are for the dual gpu variant? It concerns me that at no point in this write up did AMD even speculate what the performance inc with HBM might be.dew111 - Tuesday, May 19, 2015 - link
Why is everyone concerned about the 4GB limit in VRAM? A few enthusiasts might be disappointed, but for anyone who isn't using multiple 4k monitors, 4GB is just fine. It might also be limiting in some HPC workloads, but why would any of us consumers care about that?chizow - Wednesday, May 20, 2015 - link
I guess the concern is that people were expecting AMD's next flagship to pick up where they left off on the high-end, and given how much AMD has touted 4K, that would be a key consideration. Also, there are the rumors this HBM part is $850 to create a new AMD super high-end, so yeah, if you're going to say 4K is off the table and try to sell this as a super premium 4K part, you're going to have a hard sell as that's just a really incongruent message.In any case, AMD says they can just driver-magic this away, which is a recurring theme for AMD, so we will see. HBM's main benefits are VRAM to GPU transfers, but anything that doesn't fit in the local VRAM are still going to need to come from System RAM or worst, local storage. Textures for games are getting bigger than ever...so yeah not a great situation to be stuck at 4GB for anything over 1080p imo.
zodiacfml - Tuesday, May 19, 2015 - link
Definitely for their APUs and mobile. Making this first on GPUs helps recover the R&d without the volume scale.SolMiester - Tuesday, May 19, 2015 - link
Do the R9 290\x really perform that much better with OC memory on the cards? I didnt think AMD was ever really constrained by bandwidth, as they usually always had more on their generation of cards.Consequently, I dont see 390\x being that much competition to Titan X
Intel999 - Tuesday, May 19, 2015 - link
Thanks SolMiester,You have done an excellent job of displaying your level of intelligence. I don't think the New York Giants will provide much competition to the rest of the NFL this year. I won't support my prediction with any facts or theories just wanted to demonstrate that I am not a fan of the Giants.
BillyHerrington - Tuesday, May 19, 2015 - link
Since HBM are owned by AMD & Hynix, does other company (nvidia, etc) have to pay AMD in order to use HBM tech ?LukaP - Wednesday, May 20, 2015 - link
Its only developed by them. Its a technology that is on the market now (or will be in 6 months after it stops being AMD exclusive). Its the same with GDDR3/5. ATI did lots of the work with developing it, but NV still had the option of using it.chizow - Wednesday, May 20, 2015 - link
http://en.wikipedia.org/wiki/JEDECLike any standards board or working group, you have a few heavy-lifters and everyone else leeches/contributes as they see fit, but all members have access to the technology in the hopes it drives adoption for the entire industry. Obviously the ones who do the most heavy-lifting are going to be the most eager to implement it. See: FreeSync and now HBM.
Laststop311 - Wednesday, May 20, 2015 - link
I do not agree with this article saying gpu's are memory bandwidth bottlenecked. If you don't believe me test it yourself. Keep gpu core clock at stock and maximize your memory oc and see the very little if any gains. Now put the memory at stock and maximize your gpu core oc and see the noticeable, decent gains.HBM is still a very necessary step in the right direction. Being able to dedicate an extra 25-30 watts to the gpu core power budget is always a good thing. As 4k becomes the new standard and games upgrade their assets to take advantage of 4k we should start to see gddr5's bandwidth eclipsed, especially with multi monitor 4k setups. It's better to be ahead of the curve than playing catchup but the benefits you get from using HBM right now today are actually pretty minor.
In some ways it hurts amd as it forces us to pay more money for a feature we won't get much use out of. Would you rather pay 850 for a HBM 390x or 700 for a gddr5 390x with basically identical performance since memory bandwidth is still good enough for the next few years with gddr5.
chizow - Wednesday, May 20, 2015 - link
I agree, bandwidth is not going to be the game-changer that many seem to think, at least not for gaming/graphics. For compute, bandwidth to the GPU is much more important as applications are constantly reading/writing new data. For graphics, the main thing you are looking at is reducing valleys and any associated stutters or drops in framerate as new textures are accessed by the GPU.akamateau - Monday, June 8, 2015 - link
High Bandwidth is absolutely essential for the increased demand that DX12 is going to provide. With DX11 GPU's did not work very hard. Massive drawcalls are going to require massive rendering. That is where HBM is the only solution.With DX11 the API overhead for a dGPU was around 2MILLION draw calls. With DX12 that changes radically to 15-20MILLION draw calls. All those extra polygons need rendering! how do you propose to do it with miniscule DDR4-5 pipes?
nofumble62 - Wednesday, May 20, 2015 - link
Won't be cheap. How many of you has pocket deep enough for this card?junky77 - Wednesday, May 20, 2015 - link
Just a note - the HBM solution seems to be more effective for high memory bandwidth loads. For low loads, the slower memory with higher parallelity might not be effective against the faster GDDR5asmian - Wednesday, May 20, 2015 - link
I understand that the article is primarily focussed on AMD as the innovator and GPU as the platform because of that. But once this is an open tech, and given the aggressive power budgeting now standard practice in motherboard/CPU/system design, won't there come a point at which the halving of power required means this *must* challenge standard CPU memory as well?I just feel I'm missing here a roadmap (or even a single sidenote, really) about how this will play into the non-GPU memory market. If bandwidth and power are both so much better than standard memory, and assuming there isn't some other exotic game-changing technology in the wings (RRAM?) what is the timescale for switchover generally? Or is HBM's focus on bandwidth rather than pure speed the limiting factor for use with CPUs? But then, Intel forced us on to DDR4 which hasn't much improved speeds while increasing cost dramatically because of the lower operating voltage and therefore power efficiency... so there's definitely form in that transitioning to lower power memory solutions. Or is GDDR that much more power-hungry than standard DDR that the power saving won't materialise with CPU memory?
Ryan Smith - Friday, May 22, 2015 - link
The non-GPU memory market is best described as TBD.For APUs it makes a ton of sense, again due to the GPU component. But for pure CPUs? The cost/benefit ratio isn't nearly as high. CPUs aren't nearly as bandwidth starved, thanks in part to some very well engineered caches.
PPalmgren - Wednesday, May 20, 2015 - link
There's something that concerns me with this: Heat!They push the benefits of a more compact card, but that also moves all the heat from the RAM right up next to the main core. The stacking factor of the RAM also scrunches their heat together, making it harder to dissipate.
The significant power reduction results in a significant heat reduction, but it still concerns me. Current coolers are designed to cover the RAM for a reason, and the GPUs currently get hot as hell. Will they be able to cool this combined setup reasonably?
phoenix_rizzen - Wednesday, May 20, 2015 - link
I see you missed the part of the article that discusses how the entire package, RAM and GPU cores, will be covered by a heat spreader (most likely along with some heat transferring goo underneath to make everything level) that will make it easier to dissipate heat from all the chips together.Similar to how Intel CPU packages (where there's multiple chips) used heat spreaders in the past.
vortmax2 - Wednesday, May 20, 2015 - link
Is this something that AMD will be able to license? Wondering if this could be a potential significant revenue stream for AMD.Michael Bay - Wednesday, May 20, 2015 - link
So what`s the best course of action?Wait for the second generation of technology?
wnordyke - Wednesday, May 20, 2015 - link
You will wait 1 year for the second generation. The second generation chips will be a big improvement over the current chips. (Pascal = 8 X Maxwell). (R400 = 4 X R390)Michael Bay - Wednesday, May 20, 2015 - link
4 times Maxwell seems nice.I`m in no hurry to upgrade.
akamateau - Thursday, May 28, 2015 - link
No not at all.HBM doesn't need the depth of memory that DDR4 or DDR5 does. DX12 performance is going to go through the roof.
HBM was designed to solve the GPU bottleneck. The electrical path latecy improvement is at least one clock not to mention the width of the pipes. The latency improvement will likely be 50% better. In and out. HBM will outperform ANYTHING that nVidia has out using DX12.
Use DX11 and you cripple your GPU anyway. You can only get 10% of the DX12 performance out of your system.
So get Windows 10 enable DX12 and buy this Card; by Christmas ALL games will be out DX12 capable as Microsoft is supporting DX12 with XBOX.
Mat3 - Thursday, May 21, 2015 - link
What if you put the memory controllers and the ROPs on the memory stack's base layer? You'd save more area for the GPU and have less data traffic going from the memory to the GPU.akamateau - Thursday, May 28, 2015 - link
That is what the interposer is for.akamateau - Monday, June 8, 2015 - link
AMD is already doing that. Here is their Patent.Interposer having embedded memory controller circuitry
US 20140089609 A1
" For high-performance computing systems, it is desirable for the processor and memory modules to be located within close proximity for faster communication (high bandwidth). Packaging chips in closer proximity not only improves performance, but can also reduce the energy expended when communicating between the processor and memory. It would be desirable to utilize the large amount of "empty" silicon that is available in an interposer. "
And a little light reading:
“NoC Architectures for Silicon Interposer Systems Why pay for more wires when you can get them (from your interposer) for free?” Natalie Enright Jerger, Ajaykumar Kannan, Zimo Li Edward S. Rogers Department of Electrical and Computer Engineering University of Toronto Gabriel H. Loh AMD Research Advanced Micro Devices, Inc”
http://www.eecg.toronto.edu/~enright/micro14-inter...
TheJian - Friday, May 22, 2015 - link
It's only an advantage if you're product actually WINS the gaming benchmarks. Until then, there is NO advantage. And I'm not talking winning benchmarks that are purely used to show bandwidth (like 4k when running single digits fps etc), when games are still well below 30fps min anyway. That is USELESS. IF the game isn't playable at whatever settings your benchmarking that is NOT a victory. It's like saying something akin to this "well if we magically had a gpu today that COULD run 30fps at this massively stupid resolution for today's cards, company X would win"...LOL. You need to win at resolutions 95% of us are using and in games we are playing (based on sales hopefully in most cases).Nvidia's response to anything good that comes of rev1 HBM will be, we have more memory (perception after years of built up more mem=better), and adding up to 512bit bus (from current 384 on their cards) if memory bandwidth is any kind of issue for next gen top cards. Yields are great on GDDR5, speeds can go up as shrinks occur and as noted Nvidia is on 384bit leaving a lot of room for even more memory bandwidth if desired. AMD should have went one more rev (at least) on GDDR5 to be cost competitive as more memory bandwidth (when GCN 1.2+ brings more bandwidth anyway) won't gain enough to make a difference vs. price increase it will cause. They already have zero pricing power on apu, cpu and gpu. This will make it worse for a brand new gpu.
What they needed to do was chop off compute crap that most don't use (saving die size or committing that size to more stuff we DO use in games), and improve drivers. Their last drivers were december 2014 (for apu or gpu, I know, I'm running them!). Latest beta on their site is 4/12. Are they so broke they can’t even afford a new WHQL driver every 6 months?
No day 1 drivers for witcher3. Instead we get complaints that gameworks hair cheats AMD and complaints CDPR rejected tressfx two months ago. Ummm, should have asked for that the day you saw witcher3 wolves with gameworks hair 2yrs ago, not 8 weeks before game launch. Nvidia spent 2yrs working with them on getting the hair right (and it only works great on the latest cards, screws kepler too until possible driver fixes even for those), while AMD made the call to CDPR 2 months ago...LOL. How did they think that call would go this late into development which is basically just spit and polish in the last months? Hairworks in this game is clearly optimized for maxwell at the moment but should improve over time for others. Turn down the aliasing on the hair if you really want a temp fix (edit the config file). I think the game was kind of unfinished at release TBH. There are lots of issues looking around (not just hairworks). Either way, AMD clearly needs to do the WORK necessary to keep up, instead of complaining about the other guy.
akamateau - Thursday, May 28, 2015 - link
You miss the whole point.DX12 solves Witcher.
HBM was designed to manage the high volume of drawcalls that DX12 enables.
ALL GPU's were crippled with DX11. DX11 is DEAD.
You can't render an object until you draw it. Dx11 does not support multicore or multithreaded cpu processing of graphics, DX12 does.
With DX12 ALL CPU cores feed the GPU.
akamateau - Monday, June 8, 2015 - link
If AMD can put 4-8 gigs of HBM on a GPU then they can do the same with CPU's as well as APU's. All of the Patents that I am showing below reference 3d statcked memoryIn fact one interesting quote from the patents listed below is this:
"Packaging chips in closer proximity not only improves performance, but can also reduce the energy expended when communicating between the processor and memory. It would be desirable to utilize the large amount of "empty" silicon that is available in an interposer."
AMD has plans to fill that empty silicon with much more memory.
The point: REPLACE SYSTEM DYNAMIC RAM WITH ON-DIE HBM 2 OR 3!
Eliminating the electrical path distance to a few millimeters from 4-8 centimeters would be worth a couple of clocks of latency. If AMD is building HBM and HBM 2 then they are also building HBM 3 or more!
Imagine what 64gb of HBM could do for a massive server die such as Zen? The energy savings alone would be worth it never mind the hugely reduced motherboard size, eliminating sockets and RAM packaging. The increased amount of CPU's/blade or mobo also reduces costs as servers can become much more dense.
Most folks now only run 4-8 gigs in their laptops or desktops. Eliminating DRAM and replacing it with HBM is a huge energy and mechanical savings as well as a staggering performance jump and it destroys DDR5. That process will be very mature in a year and costs will drop. Right now the retail cost of DRAM per GB is about $10. Subtract packaging and channel costs and that drops to $5 or less. Adding 4-8 GB of HBM has a very cheap material cost, likely the main expense is the process, testing and yields. Balance that against the energy savings MOBO real estate savings and HBM replacing system DRAM becomes even more likely without the massive leap in performance as an added benefit.
The physical cost savings is quite likely equivalent to the added process cost. Since Fiji will likely be released at a very competitive price point.
AMD is planning on replacing system DRAM memory with stacked HBM. Here are the Patents. They are all published last year and this year with the same inventor; Gabriel H. Loh and the assignee is of course AMD.
Stacked memory device with metadata management
WO 2014025676 A1
"Memory bandwidth and latency are significant performance bottlenecks in many processing systems. These performance factors may be improved to a degree through the use of stacked, or three-dimensional (3D), memory, which provides increased bandwidth and reduced intra-device latency through the use of through-silicon vias (TSVs) to interconnect multiple stacked layers of memory. However, system memory and other large-scale memory typically are implemented as separate from the other components of the system. A system implementing 3D stacked memory therefore can continue to be bandwidth-limited due to the bandwidth of the interconnect connecting the 3D stacked memory to the other components and latency-limited due to the propagation delay of the signaling traversing the relatively-long interconnect and the handshaking process needed to conduct such signaling. The inter-device bandwidth and inter-device latency have a particular impact on processing efficiency and power consumption of the system when a performed task requires multiple accesses to the 3D stacked memory as each access requires a back-and-forth communication between the 3D stacked memory and thus the inter-device bandwidth and latency penalties are incurred twice for each access."
Interposer having embedded memory controller circuitry
US 20140089609 A1
" For high-performance computing systems, it is desirable for the processor and memory modules to be located within close proximity for faster communication (high bandwidth). Packaging chips in closer proximity not only improves performance, but can also reduce the energy expended when communicating between the processor and memory. It would be desirable to utilize the large amount of "empty" silicon that is available in an interposer. "
Die-stacked memory device with reconfigurable logic
US 8922243 B2
"Memory system performance enhancements conventionally are implemented in hard-coded silicon in system components separate from the memory, such as in processor dies and chipset dies. This hard-coded approach limits system flexibility as the implementation of additional or different memory performance features requires redesigning the logic, which design costs and production costs, as well as limits the broad mass-market appeal of the resulting component. Some system designers attempt to introduce flexibility into processing systems by incorporating a separate reconfigurable chip (e.g., a commercially-available FPGA) in the system design. However, this approach increases the cost, complexity, and size of the system as the system-level design must accommodate for the additional chip. Moreover, this approach relies on the board-level or system-level links to the memory, and thus the separate reconfigurable chip's access to the memory may be limited by the bandwidth available on these links."
Hybrid cache
US 20140181387 A1
"Die-stacking technology enables multiple layers of Dynamic Random Access Memory (DRAM) to be integrated with single or multicore processors. Die-stacking technologies provide a way to tightly integrate multiple disparate silicon die with high-bandwidth, low-latency interconnects. The implementation could involve vertical stacking as illustrated in FIG. 1A, in which a plurality of DRAM layers 100 are stacked above a multicore processor 102. Alternately, as illustrated in FIG. 1B, a horizontal stacking of the DRAM 100 and the processor 102 can be achieved on an interposer 104. In either case the processor 102 (or each core thereof) is provided with a high bandwidth, low-latency path to the stacked memory 100.
Computer systems typically include a processing unit, a main memory and one or more cache memories. A cache memory is a high-speed memory that acts as a buffer between the processor and the main memory. Although smaller than the main memory, the cache memory typically has appreciably faster access time than the main memory. Memory subsystem performance can be increased by storing the most commonly used data in smaller but faster cache memories."
Partitionable data bus
US 20150026511 A1
"Die-stacked memory devices can be combined with one or more processing units (e.g., Central Processing Units (CPUs), Graphics Processing Units (GPUs), and Accelerated Processing Units (APUs)) in the same electronics package. A characteristic of this type of package is that it can include, for example, over 1000 data connections (e.g., pins) between the one or more processing units and the die-stacked memory device. This high number of data connections is significantly greater than data connections associated with off-chip memory devices, which typically have 32 or 64 data connections."
Non-uniform memory-aware cache management
US 20120311269 A1
"Computer systems may include different instances and/or kinds of main memory storage with different performance characteristics. For example, a given microprocessor may be able to access memory that is integrated directly on top of the processor (e.g., 3D stacked memory integration), interposer-based integrated memory, multi-chip module (MCM) memory, conventional main memory on a motherboard, and/or other types of memory. In different systems, such system memories may be connected directly to a processing chip, associated with other chips in a multi-socket system, and/or coupled to the processor in other configurations.
Because different memories may be implemented with different technologies and/or in different places in the system, a given processor may experience different performance characteristics (e.g., latency, bandwidth, power consumption, etc.) when accessing different memories. For example, a processor may be able to access a portion of memory that is integrated onto that processor using stacked dynamic random access memory (DRAM) technology with less latency and/or more bandwidth than it may a different portion of memory that is located off-chip (e.g., on the motherboard). As used herein, a performance characteristic refers to any observable performance measure of executing a memory access operation."
“NoC Architectures for Silicon Interposer Systems Why pay for more wires when you can get them (from your interposer) for free?” Natalie Enright Jerger, Ajaykumar Kannan, Zimo Li Edward S. Rogers Department of Electrical and Computer Engineering University of Toronto Gabriel H. Loh AMD Research Advanced Micro Devices, Inc”
http://www.eecg.toronto.edu/~enright/micro14-inter...
“3D-Stacked Memory Architectures for Multi-Core Processors” Gabriel H. Loh Georgia Institute of Technology College of Computing”
http://ag-rs-www.informatik.uni-kl.de/publications...
“Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches” Gabriel H. Loh⋆ Mark D. Hill†⋆ ⋆ AMD Research † Department of Computer Sciences Advanced Micro Devices, Inc. University of Wisconsin – Madison”
http://research.cs.wisc.edu/multifacet/papers/micr...
All of this adds up to HBM being placed on-die as a replacement of or maybe supplement to system memory. But why have system DRAM if you can build much wider bandwidth memory closer to the CPU on-die? Unless of course you build socketed HBM DRAM and a completely new system memory bus to feed it.
Replacing system DRAM memory with on-die HBM has the same benefits for the performance and energy demand of the system as it has for GPU's. Also it makes for smaller motherboards, no memory sockets and no memory packaging.
Of course this is all speculation. But it also makes sense.
amilayajr - Tuesday, June 16, 2015 - link
With HBM in mind. Does AMD holds the patent for this? Is Nvidia just going to use HBM for free? Any one care to elaborate ? Because if Nvidia gets to use it for free then that's really funny for AMD side considering they are the one who research it and developed it. Am I making sense?