Name: NVIDIA Hopper GPU Architecture and H100 Accelerator Announced: Working Smarter and Harder
Item: NVIDIA Hopper GPU Architecture and H100 Accelerator Announced: Working Smarter and Harder
Author: Ryan Smith

NVIDIA Hopper GPU Architecture and H100 Accelerator Announced: Working Smarter and Harder

by Ryan Smith on 3/22/2022 11:45 AM EST

Post Your Comment
Please log in or sign up to comment.

Comments Locked

88 Comments

Back to Article

brucethemoose - Tuesday, March 22, 2022 - link
Those "Dual Gen 5 CPUs" in HGX can't be Nvidia Grace-based CPUs, can they? Or maybe CPUs from some other ARM vendor?

Nvidia would probably bring that up, if that were the case.
ballsystemlord - Tuesday, March 22, 2022 - link
They could also be a RISC-V variant, which an Nvidia engineer already talked about using in their products: https://www.youtube.com/watch?v=gg1lISJfJI0
mode_13h - Thursday, March 24, 2022 - link
No, if they were anywhere remotely close to having competitive RISC-V cores, it would be well-known. Second obvious reason is that their customers aren't on RISC-V, and it's a bad move to get too far ahead of your customers. Finally, it runs completely counter to their attempt to buy ARM, which only ended just a couple months ago.

Now that the ARM acquisition fell through, perhaps they're a little more circumspect about tying their future prospects to the ARM ISA, but they still can't afford to get ahead of their customers. For Nvidia, it's going to be x86 and ARM, for the foreseeable future.
Ryan Smith - Tuesday, March 22, 2022 - link
Grace won't be shipping until 2023. So it won't be Grace, as DGX will be shipping in Q3'22.
Yojimbo - Tuesday, March 22, 2022 - link
The obvious choice is Sapphire Rapids, which is actually planned to be available around the same time as GH100 is planned to be available. There is also Power10, but that's very unlikely.
eSyr - Tuesday, March 22, 2022 - link
Either SPR or Genoa; POWER10 is out of question as Nvidia has stated that it is an x86 CPU[1]. They're probably hedging their bets at this time, or it's just because neither has been officially announced for 3Q2022.

[1] https://cdn.wccftech.com/wp-content/uploads/2022/0...
Yojimbo - Tuesday, March 22, 2022 - link
I don't believe Genoa will be out in time. So it must be Sapphire Rapids.
Mike Bruzzone - Thursday, March 24, 2022 - link
Genoa has been shipping since q3 2021 on AMD losing its 7 nm TSMC cost : price advantage having to accommodate TSMC foundry mark up in AMD price in relation Intel SF10/x in house manufacturing cost : price advantage. Sapphire Rapids is shipping as well and both in risk quantities had to for DDR5 in system validation. At Milan late market end run and Ice late market rejected Genoa and Saphire Rapids are definitely available to compliment use sites today. mb
Jon Tseng - Tuesday, March 22, 2022 - link
Yes, but can it run Crysis at 32K??

Sorry. Had to ask. :-x
zamroni - Tuesday, March 22, 2022 - link
cyberpunk2077 4K-RT: excuse me, who are you?
p1esk - Wednesday, March 23, 2022 - link
No, but it can play Crysis
Assimilator87 - Wednesday, March 23, 2022 - link
If H100 is anything like the A100, it actually couldn't play Crysis at all, as it doesn't support DirectX.
p1esk - Thursday, March 24, 2022 - link
I bet *you* don't support DIrectX, yet you can play Crysis :)
mode_13h - Thursday, March 24, 2022 - link
No, it probably can't run Crysis. It probably has no graphics units, such as texture, raster, and tessellation engines.
Ryan Smith - Thursday, March 24, 2022 - link
2 of the TPCs are graphics enabled. However there are no RT cores, so ray tracing is not available.
mode_13h - Friday, March 25, 2022 - link
Thanks for the correction.

Details here: https://developer.nvidia.com/blog/nvidia-hopper-ar...
WestPole - Friday, March 25, 2022 - link
Wow!! Now that brings back memories!! 😁
Doug_S - Tuesday, March 22, 2022 - link
What's the purpose behind shipping it with 6x 16GB stacks of HBM3, but disabling one?
Ryan Smith - Tuesday, March 22, 2022 - link
Yield. You can't fully test the chip until you put it together. This allows for one bad memory controller (or one bad HBM3 joining) without flunking the whole chip.
Infy2 - Tuesday, March 22, 2022 - link
These will be perfect for mining Etherium. I wonder what the hash rate will be?
brucethemoose - Tuesday, March 22, 2022 - link
I know this is sarcasm, but price aside, it really does have a whole bunch of die space wasted on stuff Ethereum miners don't need.
timecop1818 - Tuesday, March 22, 2022 - link
I hope you choke on your own saliva while you sleep
Silver5urfer - Tuesday, March 22, 2022 - link
This GPU packs a lot of punch for sure. 700W on TSMC 4N means it's not a simple die shrink of any kind. Also the SMX platform is pushed to peak, which I didn't knew that's possible at all. AMDs Instinct SXM card was 500W I guess, couldn't recall the name.

Also moving on to the Vector FP compute specs, looks like it's 3X faster than Ampere. Also since these are HPC, now if we estimate to RTX Mainstream gaming cards, we are looking at the least possible case of RTX4090 / GH102 at 600W with 60+TFs (worst case) vs RTX 3090's 36TFs. That's clearly a double the increase however after Ampere Nvidia's FP calculation math changed. So the linearity is probably not expected, maybe the card could be 1.5X faster than 3090 in Raster. For RT no idea what's the measuring scale, these HPC cards are too much into RT/Tensor so it's harder to compare on how many SMs they have and the whole thing. Plus the leaks were wrong, 100Bn+ count lmao, here we are at mere 80B for a fat beefy HPC card.

All in all, TSMC 5N for us is going to be super expensive, high power consumption but the performance might be very very high. Also I'm more interested in RT performance than the garbage upscalers using Temporal Anti Aliasing, TAA which ruins all games with blur on Native resolution without some mod or hex edit and many people say upscaling is good because of that blatant exclusion of sharpness sliders in the DLSS and FSR. Nvidia axing NVlink for PC is a big kick too, last GPU to have that is 3090 and maybe 3090Ti. Wish we got HBM for the cash we pay, like damn $1000+ but nope, we are stuck with G6X.

RDNA3 with MCM is going to heat up the battle extensively for sure. Radeon 7000 series is going to be a massive Raster beast for sure. Plus this is giving me a glimpse of AMD's Zen 4, it's going to steam roll, esp with PCIe5.0 on all GPUs, NVMe, I/O and CPU across three brands.

Also on a side note Nvidia's Vulkan drivers are complete pile of shit vs AMD, a shame really. Look at Proton and DXVK updates that Steam is putting into, with a good API layer and compat, I will jump the ship from Windows 7 / 10 LTSC to Linux, cannot stand to the garbage OS Windows 11 made for dumb sheep and idiots with performance issues and spyware nonsense and ugly inefficient UX and features.
Silver5urfer - Tuesday, March 22, 2022 - link
"blatant exclusion of sharpness sliders in the DLSS and FSR"., a mistake I missed adding Native vs DLSS and FSR since Natively games do not have any sharpness for TXAA and it's by default blurred out you see in games which have TXAA disabled vs enabled.
Rudde - Tuesday, March 22, 2022 - link
As noted in the article, the PCIe version of GH100 is 350W. A GH102 would probably be less than or equal to that.
Kangal - Wednesday, March 23, 2022 - link
I wonder what this means for RTX-4000 series.
Will Nvidia shift from the Samsung 8nm node, to something more competitive like a TSMC 6nm process? Because if they stick with Samsung, they will have to use their 4nm node to stay competitive. AMD will be sticking with TSMC for the RX-7000 series and moving up onto their 5nm process.

By the time they release it will basically be 2023, and we will already have Samsung "3nm" and, even better, TSMC 4nm shipped to consumer products. Not sure about Intel's Foundary though.
CiccioB - Thursday, March 24, 2022 - link
You can get the best PP you want if you pay enough.
Problem is how many chips you manage to produce with it, as if you have high cost and small production your GPUs will be best only on paper and just smoke and mirrors for fanboys that will go bothering everyone on forum about how good, cool, efficient, eco-friendly, beautiful they are but that none is going to see for real.
Or have to send a lung to get them.
mode_13h - Thursday, March 24, 2022 - link
> AMDs Instinct SXM card was 500W I guess, couldn't recall the name.

MI250X quotes 560 W in OAM form factor. OAM is the Open Compute Project platform's standard for accelerator modules.

https://www.anandtech.com/show/17054/amd-announces...
Silver5urfer - Friday, March 25, 2022 - link
Thanks I forgot that one completely. Looks like AMD's next version of OAM will be even more insane.
Mike Bruzzone - Monday, March 28, 2022 - link
I suspect 4n is the frequency version of 5 find out soon enough . . . mb
Kevin G - Tuesday, March 22, 2022 - link
"As PCIe is still used for host-to-GPU communications (until Grace is ready, at least)"

It is worth pointing out that GV100 did support native NVLink to POWER8 processors from IBM. nVidia has partnered with other vendors on the CPU front if high performance and bandwidth are necessary. Dunno why the IBM and nVidia relationship fell apart but IBM is going a significantly different direction in terms of system design with their flexible memory/IO topologies.

"The external NVLInk Switch allows for up to 256 GPUs to be connected together within a single domain, which works out to 32 8-way GPU nodes. "

How many switches are necessary to reach that 256 GPU figure? The current sixteen A100 topologies generall use eight switches, though half of them are mainly to propagate the signaling through the backplane.

A 300% increase in performance for 75% more power is a still an improvement in performance/watt. This does make me wonder how configurable clocks and power consumption will be on the SMX5 modules as that'll be a huge increase in power consumption per system frame. Being able to run full racks of these systems at full load is going to require new infrastructure in many instances. At that density, liquid cooling is also going to become a requirement. I do see some demands in the HPC/AI sector for more 'drop in' replacements in terms of power consumption even if they're not as performant as the 700W versions.

I'm also surprised that there isn't a version with the entire 6144 bit wide memory bus version. Even for A100 I'm perplexed as to why this didn't happen for memory bandwidth and memory capacity reasons. Are packaging yields really that bad?
Ryan Smith - Tuesday, March 22, 2022 - link
"How many switches are necessary to reach that 256 GPU figure? The current sixteen A100 topologies generall use eight switches, though half of them are mainly to propagate the signaling through the backplane."

NVIDIA's own documentation is less than clear on this point. The Hopper whitepaper says: "a total of 128 NVLink ports in a single 1 RU, 32-cage NVLink Switch"

But looking at their diagrams (which are admittedly mock-ups), it looks like a full 32 node configuration uses 18 NVLink Switches.

A single 1U NVLink Switch offers 32 ports.

Which is not to be confused with the NVSwitch chips for on-board switching. NVIDIA's suggested topology there is 4 NVSwitch chips for an 8-way GPU configuration.
spikebike - Wednesday, March 23, 2022 - link
Makes sense. Each node has 8 GPUs, so 32 nodes have 256 GPUs. If each h100 has 18 links there's a switch (or bus) inside the node and then you run 1 cable to each of 18 switches. Such networks are pretty common with Infiniband networks, which Mellanox does and was purchased by Nvidia. The common mellanox standard a few years ago was 200Gbit/HDR, which is 25GB/sec per link or 50GB/sec per link bidirectionally, the same numbers NVLink has.
Mike Bruzzone - Monday, March 28, 2022 - link
switching, got it, thank you. mb
mode_13h - Thursday, March 24, 2022 - link
> Dunno why the IBM and nVidia relationship fell apart

I don't think that was the issue. I think the reason POWER 8 supported NVLink was due to a couple big ticket supercomputer contracts. With subsequent machines opting instead to use AMD and Intel CPUs, there was no longer the impetus for IBM to add NVLink support in their newer CPUs.

If IBM had wanted to add NVLink to newer CPUs, I'm pretty sure Nviida would've let them. They already paid a high price to port their entire CUDA stack to POWER.

> I'm also surprised that there isn't a version with the entire 6144 bit wide memory bus version.

Perhaps Nvidia saves those for really "special" customers with extra-deep pockets. Maybe there just aren't enough of those golden units to be worth publicly announcing.
Cooe - Tuesday, March 22, 2022 - link
Outside of Nvidia's traditional tensor op & AI/ML stronghold this looks absolutely PATHETIC compared to AMD's MI-250X.... 700W for just 30TFLOPS standard FP64??? (One THIRD of the almost 100TFLOPS from the MI-250X, and w/ higher power draw to boot!). Are you KIDDING ME??? People doing serious machine learning will buy Nvidia like they always have, but basically ANYONE in the HPC market is going to take one look at Hopper and say "Yeah.... That's freaking stupid." It's like Nvidia never wants to be in a major supercomputer ever again...
Cooe - Tuesday, March 22, 2022 - link
*That near 100TFLOPs figure for MI-250X is actually matrix FP64 so it's really 60TFLOPS vs 96TFLOPS, but that's still an absolutely MASSIVE gap for the Nvidia part pulling +200W more power!!! Basically half the performance for like +1/3rd more power...
mode_13h - Thursday, March 24, 2022 - link
> Basically half the performance for like +1/3rd more power...

That's awfully fuzzy math, for someone banging on about HPC. MI250X offers about 59.7% more fp64 vector performance and 59.5% more fp64 matrix performance. Not *double*.

As for power, the H100's 700 W is 25% more than the MI250X's 560 W.

Now, if you want to talk efficiency, then we get 99.6% and 99.4% more perf/W at fp64 vector and matrix, respectively. However, that presumes customers will run either at their max rated speeds, which power-sensitive customers are unlikely to do.
cake_lover - Thursday, March 24, 2022 - link
AMD's numbers are a fairytale. Their quoted 95 TFLOPS in fp64 cannot be maintained at 560 watts, and the GPU ramps down its clocks when you try. If you look at the small print on AMD's marketing you can see this for yourself: HPL efficiency is only at ~45% of peak.

Moreover, quoted max TDP does not equal max power. The only way to compare what the power efficiency of two processors is to look at the consumed power when running a given workload.

A good vetted source of HPC fp64 processor efficiency is the Green500. If AMD's power efficiency is what they claim it to be in their slideware, then it will show up there.

https://www.top500.org/lists/green500/
mode_13h - Friday, March 25, 2022 - link
> AMD's numbers are a fairytale. Their quoted 95 TFLOPS in fp64 cannot
> be maintained at 560 watts, and the GPU ramps down its clocks when you try.

I think both AMD and Nvidia are guilty of pushing specs based on boost clocks.

> the small print on AMD's marketing ...: HPL efficiency is only at ~45% of peak.

That's yet again different. The numbers on *both* AMD and Nvidia's spec sheets are theoretical. For actual benchmark results, there's always a gap with such theoretical numbers.

> Moreover, quoted max TDP does not equal max power.

If you're concerned about *sustained* performance, then TDP should be your number.

> The only way to compare what the power efficiency of two processors is
> to look at the consumed power when running a given workload.

Well, yes. We'd like to actually benchmark these things, but most of us cannot. Especially when they haven't even started shipping, yet.
CiccioB - Thursday, March 24, 2022 - link
Good try, like if AMD and Nvidia GPU are going to saturate their TDPs while using <b>only</b> FP64 calculations.
Oh, well, yes, AMD GPU does, ahahaha.

We'll have a look at real benchmarks when this monster will be out.
Qasar - Thursday, March 24, 2022 - link
um have you seen the rumbles the 3090ti could be using when it comes to power ? its known you hate amd, but come on cicciob, even you cant be that bias.
CiccioB - Friday, March 25, 2022 - link
What 3090TI has to do with this discussion on FP64 efficiency among all other calculations units that consumer GPUs do not have?
Do you know HPC and games GPUs belong to different markets with different purpose aimed ad different customers with different needs?

And I'm not against AMD, I'm against AMD's stupid claims. My next build can be and AMD one if Zen4 comes out to be better than MeteorLake, I do not have prejudices.
CiccioB - Friday, March 25, 2022 - link
* AMD fanboys' stupid claims

Looking at how you answered, trying to throw shit on an Nvidia product at random, I may suspect I am talking with one of them
Qasar - Friday, March 25, 2022 - link
" I do not have prejudices "
what a pile of bs right there. based on all of your previous posts, you do so have have prejudices, amd its against amd, this is fact
" AMD fanboys' stupid claims "
just like all of you stupid intel and nvidia claims ? and your bias against amd claims.

my point was, nvidia seems to be using more and more power for its video cards, while amd does not. based on that, its proabably a safe bet, h100 will be the same. its rumoured that nvidia's next cards, will use even more power.
CiccioB - Friday, March 25, 2022 - link
AMD does not what?
Even AMD has a 350W consumer video card and its Mi250 has a TDP of 560W.
Next 16 core Zen4 done on 5nm PP is going to have a 170W TPD, 60% more than current 5950X.
Everyone is raising TDP doe to PP not scaling as they did in the past.
What are you talking about?
You are just trying to throw shit against Nvidia speaking of things unrelated to this thread.
You just showed you are a clueless AMD fanboy.
Qasar - Friday, March 25, 2022 - link
and you are not an intel and nvidia fan boy ? come on, as i said MOST of your OWN posts bash amd in some way or another, before you call some one a fan boy, admit to being one your self.

amd may be raising their TDP, but at least they stay close to it compared to intel and nvidia.
" Next 16 core Zen4 done on 5nm PP is going to have a 170W TPD, 60% more than current 5950X " and how high does intels cpus actually use compared to what they state ?? in some cases, TWICE.

what ever fan boy, go praise your intel and nividia gods
Abe Dillon - Tuesday, March 22, 2022 - link
They're different cards with different design goals and trade-offs. Nobody is kidding you. Calm down.

It's not a huge secret that Nvidia has been targeting the ML market for about a decade now and it's worked out pretty well for them. The last thing we need is more fanboi "professionals" in the HPC space. Use what works for you. If Nvidia doesn't offer what you need, go somewhere else. Simple as that.

Also, ML is a sub-market of the HPC market unless you have some bizzarre definition of "HPC"...
AdrianBc - Wednesday, March 23, 2022 - link
I would say that defining either ML or HPC as a sub-market of the other would be a bizarre definition.

These 2 application domains are very different. HPC needs high-precision and very good reliability, while ML is content with low-precision and a few random errors would have little impact.

Because the requirements differ, it is possible to design devices that are very good for 1 of the 2 domains, while being bad for the other.

Moreover not only the hardware is different, but also the users are not the same. It is quite rare for someone to be interested professionally both in ML and HPC.
mode_13h - Thursday, March 24, 2022 - link
> I would say that defining either ML or HPC as a sub-market of the other
> would be a bizarre definition.

I would've agreed with you, but apparently an increasing number of HPC applications are starting to embrace deep learning techniques to improve their solution quality, reduce computation times, or both.

https://www.nextplatform.com/2018/09/26/deep-learn...

Otherwise, you'd expect AMD and Nvidia would probably make pure fp64 accelerators for HPC and chips with no fp64 for machine learning. In fact, I could still see them doing the latter. Or maybe they go forward with a full bifurcation and just sell interested HPC customers on using a mix of the different accelerator types.
mode_13h - Thursday, March 24, 2022 - link
According to this, AMD's MI250X delivers 47.9 TFLOPS of fp64 vector and 95.7 of fp64 matrix perf in 560 W.

https://www.anandtech.com/show/17054/amd-announces...

Compare that to 30 fp64 vector TFLOPS and 60 fp64 matrix TFLOPS in the H100 and they actually had a competitive solution... if they were assuming AMD was going to keep their micro-architecture at 32-bit.

Remember AMD doubled everything to 64-bit, in order to deliver those numbers. So, it was probably a case of Nvidia just adding enough fp64 horsepower to keep ahead of where they *expected* AMD to be.

Still, it doesn't change the fact that AMD comfortably held its lead, on the fp64 front.
DaGunzi - Tuesday, March 22, 2022 - link
> Though putting PCIe 5.0 to good use is going to require a host CPU with PCIe 5.0 support, which isn’t something AMD or Intel are providing quite yet.

I guess one could put one of the PCIe Cards on a 12th Gen Intel LGA 1700 System and enjoy PCIe 5.0 in the solitude of one GPU?
mode_13h - Thursday, March 24, 2022 - link
Amazon's Graviton 3 has PCIe 5.0. Amazon claims they were already deploying it towards the end of last year.

https://semianalysis.substack.com/p/amazon-gravito...

That's not to say Nvidia is going to source CPUs from Amazon, just that it's apparently not so big a lift that someone like Ampere or even Marvell couldn't release an ARM CPU in time.

That said, my bet is on Intel's Sapphire Rapids.
zamroni - Tuesday, March 22, 2022 - link
jensen has to release hopper because amd mi250 performance is way higher than ga100
CiccioB - Thursday, March 24, 2022 - link
Yes, it it were not for the Mi250 Nvidia would have skipped its traditional eash-2-years presentation of a new architecture for the first time in Nvidia history and would have gone ahead with Ampere with Blackwell till 2024.
Nice one!
ET - Wednesday, March 23, 2022 - link
I dug through all the H100 material NVIDIA released to try to find something about the DPX instructions, because they sound quite interesting, but unfortunately there's nothing there. I'm guessing that NVIDIA prefers to keep this a secret until the product's release.

It's nice that Ryan chose to compare to the A100 80GB, as a lot of NVIDIA docs compare to the 40GB version (which has lower speed HBM2) to make the new product look better.

Still, looks like an impressive product and I'm looking forward to it.
mode_13h - Thursday, March 24, 2022 - link
For that level of detail, you should keep checking here or in Nvidia's developer forums.

https://docs.nvidia.com/cuda/cuda-c-programming-gu...

However, the last update to their CUDA API docs is dated March 11th. Be patient, as Q3 is still a ways off.
back2future - Wednesday, March 23, 2022 - link
"20 H100s can sustain the equivalent of the world's Internet traffic"
that's a 15kW TDP for 2022 total www network data transfer (~120-170Tbps (2019-2020), ~1/4 network transfer capacity 2020), and the naming scheme is best seen (since some time) with combinations like Grace Hopper :)
back2future - Wednesday, March 23, 2022 - link
well, rethought that, Who like to call computers like humans?
Back to technical terms, AFAIC2M, from this point of view.
nandnandnand - Wednesday, March 23, 2022 - link
"20 H100s can sustain the equivalent of the world's Internet traffic"
Sounds like a useless marketing factoid.
back2future - Wednesday, March 23, 2022 - link
It's TB/s of network data traffic (~4000EB/a or 130TB/s, for global cable and mobile) compared to memory bandwidth. On that predicted 2022 numbers (Cisco) that would need 40-45 H100's (thus ~30kW TDP). 20 devices would be comparable to Cisco's prediction for 2019.
https://www.ieee802.org/3/ad_hoc/bwa2/public/calls...
mode_13h - Thursday, March 24, 2022 - link
> the world's Internet traffic

That's about as meaningful as comparing your CPU's L3 cache bandwidth to the world's internet traffic. The internet is slower, precisely because it's more costly and energy-intensive to transmit data over long distances. Also, a lot more switching is needed.

It reminds me of the days when the popular news media would try to explain storage capacity in terms of how many copies of some encyclopedia or even the entire Library of Congress it would hold. As it's not what people actually transmit or store, the comparison is rubbish.
back2future - Thursday, March 24, 2022 - link
You don't like it.
But it gives a picture, at least.
back2future - Thursday, March 24, 2022 - link
global data storage capacity might be comparable to ~6zettabytes(ZB, hw cost storage $225bn, 22.5tn₽ or ₴675bn) for 2020,

while 2020 volume for data created was a maybe 40ZB (all data/information created 2021 ~75ZB, ~0.33MB/s each person on planet Earth, other estimation 2016 ~1.7MB/(person*s) Northeastern University), 50% of all internet traffic is from mobile phones, 2021 ~73% of all traffic was video media contents

share of 60% of data were stored on personal computing devices ~2010, while ~2020 ~60% were stored in data centers (cloud) and all www electricity consumption might be assumed with 10% of total global electricity production or about 2500TWh/a (while maintaining 1GB of data requires ~5kWh of background energy, on a ~2020 60-65% fossil primary energy sources)

while ~3M emails each second are sent, *2/3* are declared spam (Internet Live Stats, 2021)

(Sorry, but this is no video and above link summarizes sum of this, from 'Cisco VNI Forecast update', 2019, pdf ~1.2MB (64pages, 19.2kB/pg) )
mode_13h - Thursday, March 24, 2022 - link
> combinations like Grace Hopper :)

I'm glad you like it, because it definitely annoys me. Not the "Hopper" part, but calling the CPU "Grace". But, it's not like I'm going to lose even a wink of sleep over it.
CiccioB - Thursday, March 24, 2022 - link
Just out of curiosity, why a GPU called "Hopper" is good and a CPU called "Grace" is so disturbing?
mode_13h - Thursday, March 24, 2022 - link
Hopper is a surname, whereas Grace is either not, or it's synchronized to appear as if it's potentially not.

Perhaps you're about to note that Grace also isn't a GPU, but I'd then counter that a naming scheme based on first names strikes me as rather silly. That convention would've had prior generations named like James, Alan, and André-Marie. Not terrible names for a CPU, but definitely not among the better ones.

Next, consider that Grace is designed to be paired with many Hoppers. So, you'd have Grace + Hoppers or Graces + Hoppers as the standard configuration. When written accurately, it's not even the name of the person it's trying to form. And what if the successor to Hopper is introduced before the successor to Grace? Then, you'd have something like Grace + Lovelace or the converse would yield Ada + Hopper.

Finally, it's just goofy. And seems maybe driving too hard to make a point. Dare I say even pandering?
nadim.kahwaji - Wednesday, March 23, 2022 - link
Great Great Article, i can confidently say that Ryan is back :) ....
mode_13h - Thursday, March 24, 2022 - link
This is not a review. It's based on marketing material Nvidia pre-released to all major tech press, under embargo.

That's not to say that Ryan did nothing, but it's not the same as original testing and analysis.
mode_13h - Thursday, March 24, 2022 - link
I was just saying in the article comments of the MI210 that AMD messed up by sticking with PCIe 4.0, for Aldebaran. This proves it.

What I'd *really* like to know, but we'll probably never find out, is what the peak-efficiency points are for the MI250X and the H100.

I'd definitely like to know more about DPX. I really hope it trickles down to their consumer GPUs, to some degree.

What Nvidia has done right that AMD missed (since RDNA) is making its consumer GPUs basically an inexpensive development platform for the high-performance versions. That and good developer support, even on gaming cards, is what propelled Nvidia's popularity with developers. AMD still doesn't seem to get it. Their hardware is going nowhere, if they alienate the development community by shutting out everyone without access to a $10k CDNA card.
mode_13h - Thursday, March 24, 2022 - link
The other point I was going to make is that I wonder how much on-die memory they have. This seems to be the growing trend in deep learning & data-flow processors. GPUs are good at hiding latency, but it's even better not to have high latency, in the first place.

Also, data movement + lookups in cache tag RAM are probably where Nvidia is burning a significant chunk of power.
CiccioB - Thursday, March 24, 2022 - link
<blockquote>Also, data movement + lookups in cache tag RAM are probably where Nvidia is burning a significant chunk of power.</blockquote>
Until there will be the tech to put several GB of SRAM directly and cheaply on GPU die that kind of inefficiency has to be taken into account.
All you can do meanwhile is trying to re-use data and results as much as possible without moving them back and forth, and it seems Nvidia has been targeting this goal with some of the new features added to Hopper.
mode_13h - Friday, March 25, 2022 - link
> Until there will be the tech to put several GB of SRAM directly and
> cheaply on GPU die that kind of inefficiency has to be taken into account.

That's a false dichotomy. Nvidia can increase their SRAM substantially, without going all the way to GBs. And I was merely wondering how much they have, because it's an undeniable trend.

FWIW, each SM in the A100 has "unified data cache and shared memory with a total size of 192 KB", with up to 164 KB of it being configurable as shared memory. So, that's about 20.25 MB of on-die SRAM of which up to 17.3 MB is directly-addressable. Relative to the amount of compute power it has, I think that's smaller than what competing AI silicon tends to feature.

So far, the figures I've managed to find are 900 MB for a 1472-core Graph Core chip, Xilinx Versal AI Edge claiming 38 MB for a "mid-range" implementation, 300 MB for SambaNova's Cardinal SN10 RDU, 160 MB for Esperanto's 1k core RISC-V chip, Qualcomm AI100 has 144 MB, and Cerebras WSE-2 featuring 40 GB. Obviously, that Cerbras is not a fair comparison.

Also, worth noting: DE Shaw's ANTON 3 features 66 MB of SRAM, but that's a purpose-built chip for molecular dynamics simulations.
mode_13h - Friday, March 25, 2022 - link
Forgot to add that most/all of those are TSMC N7. I got a lot of them from the Hot Chips 2021 presentations. I didn't bother to look through the 2020 set.
CiccioB - Friday, March 25, 2022 - link
The amount of data you can keep on-die helps greatly to decrease both access latency and also transfer power.
That's a fact, not a "false dichotomy", as there is no dichotomy here.

GH100 is a 800mm^2 and beyond chip filled with any possible calculation unit to be able to execute any sort of work thrown at it. The amount of cache that it can hold inside the die is small, or you would have to sacrifice brute force or some unit entirely diminishing its flexibility.
It has 60MB of cache as whole, but would it also be 120MB or 320MB, it may be too few nonetheless.
This monster can elaborate 3TB/s of data in a second, you can see yourself those cache are ridiculously small to not pretend that a lot of data still comes and go to the HBM stack also as a temporary place to be accessed again later.
Probably putting something as a 3D cache on it, reaching several hundreds of fast SRAM would help a lot in lowering data movement outside the die.

However, cache and SRAM suck energy by themselves so you have to compromise the energy for keeping data near the cores and that used for transfer useless data far away to the HBM stack.
I suppose Nvidia did its calculations, both in terms of performance, cost, thermal and probabilities and did what they could to try to lease the data transfer power consumption issue with the transistor budget they had. That does not mean that having bigger caches (or a L3 cache on separate 3D die) would not help in such a data swallower monster.
CiccioB - Friday, March 25, 2022 - link
You may see what Nvidia has done for this data transfer problem directly from the description of their own architecture:
* New thread block cluster feature enables programmatic control of locality at a granularity larger than a single thread block on a single SM. This extends the CUDA programming model by adding another level to the programming hierarchy to now include threads, thread blocks, thread block clusters, and grids. Clusters enable multiple thread blocks running concurrently across multiple SMs to synchronize and collaboratively fetch and exchange data.
* Distributed shared memory allows direct SM-to-SM communications for loads, stores, and atomics across multiple SM shared memory blocks.
* New asynchronous execution features include a new Tensor Memory Accelerator (TMA) unit that can transfer large blocks of data efficiently between global memory and shared memory. TMA also supports asynchronous copies between thread blocks in a cluster. There is also a new asynchronous transaction barrier for doing atomic data movement and synchronization.
* 50 MB L2 cache architecture caches large portions of models and datasets for repeated access, reducing trips to HBM3.

On complete chip L2 cache amount to 60 MB.
mode_13h - Saturday, March 26, 2022 - link
> You may see what Nvidia has done for this data transfer problem

These help them get more use out of what they have, but are ultimately limited in how much they can compensate for lack of quantity.

> Tensor Memory Accelerator

Doesn't appear to reduce data movement, thus energy utilization. All it does is hep the compute units stay busy doing actual computation. That's only tackling half the problem.
Mike Bruzzone - Monday, March 28, 2022 - link
And X and F variants are where on their L3 low latency aims? I wonder what compliment that hooks up too? mb
mode_13h - Saturday, March 26, 2022 - link
> That's a fact, not a "false dichotomy", as there is no dichotomy here.

You quoted me saying:

me> data movement + lookups in cache tag RAM are probably where Nvidia
me> is burning a significant chunk of power." and then claimed:

you> Until there will be the tech to put several GB of SRAM directly and cheaply
you> on GPU die that kind of inefficiency has to be taken into account.

The false dichotomy is that there's no use trying to mitigate the cost of cache & data movement until you can put GB of RAM on die.

I quoted numerous examples where (mostly) AI processors competing with Nvidia reached a different conclusion. Some went as far as to say they didn't even need HBM, due to the amount of on-die SRAM they had, instead using conventional DDR4 for bulk storage.

If you know anything about the data-access patterns involved in neural net inferencing, this makes perfect sense. The weights tend to be much larger than the data propagating through the network. So, the more on-die storage you have, the less you need to page them in from DRAM, assuming you can do some form of batch-processing.

> GH100 is a 800mm^2 and beyond chip filled with any possible calculation unit
> to be able to execute any sort of work thrown at it. The amount of cache that
> it can hold inside the die is small, or you would have to sacrifice brute force or
> some unit entirely diminishing its flexibility.

This is a silly argument. All of these chips are optimized to do heavy computation. Your only legit point is that maybe some of the H100's target workloads, besides AI, lack the same degree of locality. So, maybe they chose to hurt AI performance a little bit, in order to devote more die area to things like fp64, so they could better counter AMD on that front.

> It has 60MB of cache as whole

Well, it's not ideal to have as cache, for the reasons I mentioned. The software is already custom-tuned for the GPU, so you might as well use directly-addressable SRAM and avoid the energy & latency penalty of the tag lookups.

Now that I'm looking at Nvidia's blog, I see they have up to 26 MB of on-die "shared memory". The "up to" part is a reference that some amount of it could be used as either cache or directly-addressable, which makes it clear they understand the benefits. Otherwise, they'd just make it all cache.

> This monster can elaborate 3TB/s of data in a second

That burns a lot of power. It's better if you can optimize your memory hierarchy and data access patterns to keep the most frequent accesses on-die. Some of those Hot Chips presentations I poured through to collect the SRAM tallies in my prior post mentioned far greater throughput, like I think one mentioned 12 TB/sec and GraphCore mentioned 62 TB/sec for accessing their SRAM contents.

> those cache are ridiculously small to not pretend that a lot of data still comes
> and go to the HBM stack also as a temporary place to be accessed again later.

900 MB is ridiculously small? Also, you don't seem to know anything about the data access patterns in question. When building high-performance hardware, it pays to know something about the software it's intended to run. Ideally, the hardware and software should be co-optimized.

> Probably putting something as a 3D cache on it, reaching several hundreds
> of fast SRAM would help a lot in lowering data movement outside the die.

One of those I mentioned used 3D stacking, but I think GraphCore (the one with 896 MB) didn't. They say half the die is SRAM.

> I suppose Nvidia did its calculations, both in terms of performance, cost, thermal

GraphCore claims their entire quad-processor chassis uses 800-1200 W typical, 1500W peak

> transistor budget

GraphCore's is 59.6 B, which they say is TSMC's biggest N7 chip.

I definitely think Nvidia generally knows what they're doing. That's why I expected more on-die SRAM. However, it's risky to assume that just because Nvidia did something one way, that it's optimal. Nvidia is tackling a broader problem set than most of their competitors in the AI market. This could finally start to hurt them.
Mike Bruzzone - Monday, March 28, 2022 - link
mb
CiccioB - Monday, March 28, 2022 - link
I think you have not understood a simple thing: the chips you have listed are specialized chip that are designed to do a single thing or to work in a niche market. Where maybe they can sell for some ten thousands dollars each. Even AMD put some MB of cache on they game GPU and got a biog benefit capable of reducing the bus width towards the RAM (reducing power consumption and complexity costs). But they are aimed at a special market and manage at that with just few tens of MB, but if used for anything that is not moving textures and pixels they are terrible.
GH100 is a GENERAL PURPOSE computing chip.
If you understand this, you'll understand that it is also a quite big chip at the reticule limit.
Now, of all the space you have available, you have to choose if it has to become cache or computing unit.
The more cache you put, the best chance SOME work run faster, but you for sure limit the total chip performance or even flexibility if you decide to remove a type of unit as a whole.
So GH100 cannot be compared to specialized chips. It's aimed at a different market.

About the amount of cache, you tell me I do not know anything about data access patterns, but you clearly do not know anything about Nvidia GPU cache organization if you can't tell configurable L1 cache (which can be partially configured as cache or a register file) vs plain L2 cache.
And I said that some hundreds of MB can be of some use to the chip. But you cannot put them on die, and you cannot have the chip suck 1500W of power for a 30% performance increment at best on some computational tasks.
Look at what you want, but cache on Hopper bigger than Ampere's and way bigger than AMD's Aldebaran chip. So Nvidia here put what they could to the problem.

We can't really teach Nvidia about the best choice of their architectures. You know, there are years of thorough simulations behind every choice, calculations on costs, power consumption and probability of being used. All compared to alternative choices. We know so little of the thing they have chosen to put on the chip, let's imagine about those they have not.
If they chose to make it that way, its because they thought (and simulated) that it best covers the market they are targeting.

Cache is a "dumb" option, transparent to the SW, easy to make, easy to extend as needed/possible. Unfortunately it is also space and power hungry. Everyone would like to put hundreds of MB of it anywhere. But it's not a secret that you cannot unless your chip becoming big and expensive. And that's why everyone is studying these 3D lego bricks with behemoth buses capable of using separate cache as if it were on the same die with minimal extra use of energy and latency (but radically truncating costs, so you can compensate with quantity to the quality).
Next generation technology will be mature enough to allow for 3D stacked chip of every kind. Cache too.
We'll see how bit they become, but a thing is certain: they will increase.
mode_13h - Wednesday, March 30, 2022 - link
> I think you have not understood a simple thing

I think you didn't read my posts as carefully as I wrote them.

> GH100 is a GENERAL PURPOSE computing chip.

This point is not lost on me. Neither is the fact that Nvidia is counting on it being competitive on AI workloads. So, they cannot afford to lose too much ground to their up-and-coming purpose-built AI competition. This is why I expected to see them do more to keep up in terms of on-chip storage.

> GH100 cannot be compared to specialized chips. It's aimed at a different market.

It can and it will be. Someone doing AI workloads isn't going to give the H100 a pass because it has other features they don't need. They're going to compare it in terms of things like perf/$, perf/W, or simply perf/TCO on *their* workloads vs. the pure-bred competition. That's how the H100 is going to be judged.

Similarly, someone doing HPC workloads that don't involve AI (which is probably still the majority of them) isn't going to care if it has AI acceleration capabilities they have no intention of ever using.

This is the hazard of targeting too many markets with a single product. It's why AMD culled graphics hardware from their CDNA chips and it's probably why they + Nvidia are going to further bifurcate their big GPUs into separate AI- and HPC- oriented products.

> I said that some hundreds of MB can be of some use to the chip.
> But you cannot put them on die

GraphCore did. They put 900 MB of SRAM on a single N7 die. They said it took up half the die. Are you not even reading my posts?

Nvidia is using a smaller node and wouldn't have to use quite that much, of course.

> and you cannot have the chip suck 1500W of power for a 30% performance
> increment at best on some computational tasks.

Where are you getting that number? AI processors use lots of SRAM because it provides a net power *savings*!! Again, because you don't seem to be reading my posts, I have to repeat that their entire 4-processor chassis uses "800-1200 W typical, 1500W peak". That's only 200 - 300 W per chip, with a 375 W max. But it's surely not that high, because other things in the chassis are using some of that power. So, maybe we could imagine 175 - 275, 350 W max (if not less), which is no more than *half* what the H100 is using!

> way bigger than AMD's Aldebaran chip

So what? AMD is almost a non-player in the AI market. Again, you're missing how the H100 will be judged. People evaluating it for AI will compare it against the best-in-class AI chips, not some other chip that they're not using and probably wouldn't even consider.

> We can't really teach Nvidia about the best choice of their architectures.

I'm not trying to. I'm just an interested observer.

> there are years of thorough simulations behind every choice

Yeah, but I already addressed this. You assume that because they're smart and they've been successful in this space, that they're making the right decisions. And if we accept as given that they need to balance the performance of the chip between HPC and AI, then maybe they did. The key question is whether that put them at too much of a disadvantage against pure-play AI chips.

> Cache is a "dumb" option, transparent to the SW

You're not reading me carefully enough. I didn't advocate for cache. Cache is indeed not a smart move. What they should prefer is more directly-addressable on-die memory.

> it's not a secret that you cannot unless your chip becoming big and expensive.

What is the H100, if not big and expensive?

I don't really see the point of you replying to me, when I've already addressed most of the points you're making. Please save us both time and actually READ MY POST, before hitting "reply".
Mike Bruzzone - Monday, March 28, 2022 - link
High bandwidth solid state memory sucks power less and even if not as dense as captive charge for the legacy cost who cares other than latency. mb
mode_13h - Monday, March 28, 2022 - link
> High bandwidth solid state memory sucks power less

Less than "what" is the key question, and using what metric.

> who cares other than latency.

Even HBM3 can't touch the potential bandwidth & latency advantages of on-die SRAM.

And you'd do well to remember the true cost of latency. In order to hide latency, Nvidia needs SMT at least on a scale of like a dozen or so. That's not free. It means they have to duplicate their ISA state, which means lots more registers and datapaths, which burns lots more real estate.

So, a far better approach than latency-hiding is just to have less latency, in the first place. And if you know your data access patterns and can co-optimize the software and hardware well enough, then maybe you can do just that. If you read through this site's extensive Hot Chips coverage, from the past couple years, you'll see that's exactly what Nvidia's AI-focused competitors have done.
Mike Bruzzone - Monday, March 28, 2022 - link
So you're saying v Nvidia GPU the replacements technologies are FPGA SRAM fabric with the processor(s) on top? mb
mode_13h - Monday, March 28, 2022 - link
Search out the Hot Chips presentations for the past couple years, and you can see what people are doing. IIRC, only one of them (Xilinx, of course) was FPGA-based.

I'm not a huge fan of FPGAs for AI, but maybe that's just me. IMO, the problem is well enough understood that other approaches are more area- & power- efficient. So, if you didn't have other reasons to use a FPGA, then you'd probably be better of with a different solution.

That said, I'm willing to accept there could be some network and layer types that really don't work well on existing & upcoming silicon solutions. I'm not a deep learning expert, but I'm peripherally involved with it.
mode_13h - Monday, March 28, 2022 - link
Oh, and a tip when reading the Hot Chips liveblogs on this site is that you can right-click and open the images in another tab to actually read them. They're much higher-resolution than how they appear inline.
Mike Bruzzone - Thursday, March 24, 2022 - link
H100 and generally all new processing device generations are so far ahead of software for their application development tools to chase. Part of sustainable segment stronghold embrace. Reinforces the utility value of back generation produced on depreciated process for years to come. Price elasticity on demand remains a function of time where on supply side component design producer/manufacturer product generations now have more runway, at least for the good architectural examples that deliver 'stretch'.

Stratifies foundation of back generation processors of all types, for better and for worse (into future lock in lieu of a replacement technology emerging) on application use for 'utility value' as software developers catch up applications wise demonstrating return on investment.

New generation now stack on top performance wise but cost : price wise wider adoption will be found back generation while the mass commercial and technical markets wait for proven whole product on their own application uses.

Unless your business is the business of compute H100 will not be relevant for years, and then they'll be available from secondary market on a more cost effective bases then anytime into the near future that is likely the next three to five years?

Mike Bruzzone, Camp Marketing

NVIDIA Hopper GPU Architecture and H100 Accelerator Announced: Working Smarter and Harder

Post Your Comment

88 Comments

Back to Article

brucethemoose - Tuesday, March 22, 2022 - link

ballsystemlord - Tuesday, March 22, 2022 - link

mode_13h - Thursday, March 24, 2022 - link

Ryan Smith - Tuesday, March 22, 2022 - link

Yojimbo - Tuesday, March 22, 2022 - link

eSyr - Tuesday, March 22, 2022 - link

Yojimbo - Tuesday, March 22, 2022 - link

Mike Bruzzone - Thursday, March 24, 2022 - link

Jon Tseng - Tuesday, March 22, 2022 - link

zamroni - Tuesday, March 22, 2022 - link

p1esk - Wednesday, March 23, 2022 - link

Assimilator87 - Wednesday, March 23, 2022 - link

p1esk - Thursday, March 24, 2022 - link

mode_13h - Thursday, March 24, 2022 - link

Ryan Smith - Thursday, March 24, 2022 - link

mode_13h - Friday, March 25, 2022 - link

WestPole - Friday, March 25, 2022 - link

Doug_S - Tuesday, March 22, 2022 - link

Ryan Smith - Tuesday, March 22, 2022 - link

Infy2 - Tuesday, March 22, 2022 - link

brucethemoose - Tuesday, March 22, 2022 - link

timecop1818 - Tuesday, March 22, 2022 - link

Silver5urfer - Tuesday, March 22, 2022 - link

Silver5urfer - Tuesday, March 22, 2022 - link

Rudde - Tuesday, March 22, 2022 - link

Kangal - Wednesday, March 23, 2022 - link

CiccioB - Thursday, March 24, 2022 - link

mode_13h - Thursday, March 24, 2022 - link

Silver5urfer - Friday, March 25, 2022 - link

Mike Bruzzone - Monday, March 28, 2022 - link

Kevin G - Tuesday, March 22, 2022 - link

Ryan Smith - Tuesday, March 22, 2022 - link

spikebike - Wednesday, March 23, 2022 - link

Mike Bruzzone - Monday, March 28, 2022 - link

mode_13h - Thursday, March 24, 2022 - link

Cooe - Tuesday, March 22, 2022 - link

Cooe - Tuesday, March 22, 2022 - link

mode_13h - Thursday, March 24, 2022 - link

cake_lover - Thursday, March 24, 2022 - link

mode_13h - Friday, March 25, 2022 - link

CiccioB - Thursday, March 24, 2022 - link

Qasar - Thursday, March 24, 2022 - link

CiccioB - Friday, March 25, 2022 - link

CiccioB - Friday, March 25, 2022 - link

Qasar - Friday, March 25, 2022 - link

CiccioB - Friday, March 25, 2022 - link

Qasar - Friday, March 25, 2022 - link

Abe Dillon - Tuesday, March 22, 2022 - link

AdrianBc - Wednesday, March 23, 2022 - link

mode_13h - Thursday, March 24, 2022 - link

mode_13h - Thursday, March 24, 2022 - link

DaGunzi - Tuesday, March 22, 2022 - link

mode_13h - Thursday, March 24, 2022 - link

zamroni - Tuesday, March 22, 2022 - link

CiccioB - Thursday, March 24, 2022 - link

ET - Wednesday, March 23, 2022 - link

mode_13h - Thursday, March 24, 2022 - link

back2future - Wednesday, March 23, 2022 - link

back2future - Wednesday, March 23, 2022 - link

nandnandnand - Wednesday, March 23, 2022 - link

back2future - Wednesday, March 23, 2022 - link

mode_13h - Thursday, March 24, 2022 - link

back2future - Thursday, March 24, 2022 - link

back2future - Thursday, March 24, 2022 - link

mode_13h - Thursday, March 24, 2022 - link

CiccioB - Thursday, March 24, 2022 - link

mode_13h - Thursday, March 24, 2022 - link

nadim.kahwaji - Wednesday, March 23, 2022 - link

mode_13h - Thursday, March 24, 2022 - link

mode_13h - Thursday, March 24, 2022 - link

mode_13h - Thursday, March 24, 2022 - link

CiccioB - Thursday, March 24, 2022 - link

mode_13h - Friday, March 25, 2022 - link

mode_13h - Friday, March 25, 2022 - link

CiccioB - Friday, March 25, 2022 - link

CiccioB - Friday, March 25, 2022 - link