If AMD gain any traction it won't be with this product. Even not considering the CUDA moat, this AMD card will be "available" in Q4. In fact, they won't have any significant volume with it then. But regardless, they are targeting data centers and since AMD doesn't have wide-scale deployment of their GPU tech in data centers now, there would be a lengthy verification process before full deployment. Once that would happen, and it would probably be at least 6 months, AMD would, again throwing out NVIDIA's superior software library support and any application code users may already have optimized for NVIDIA's hardware, perform roughly equal to a card that NVIDIA would have had in general availability for over a year and a half. Roughly equal, that is, except for AI training workloads where NVIDIA's Tensor Cores would give NVIDIA an advantage. Furthermore, by the time all that happens NVIDIA will be close to releasing their next-generation data center compute card on 7 nm, which could arrive in late 2019 or in 2020 (in terms of availability. they may announce something at GTC San Jose in March or whenever it will be held, but it wouldn't have real availability until months later, much like this AMD MI60 card). NVIDIA, already having their GPUs in data centers, can get their products verified much faster. This MI60 card might end up having to go toe-to-toe with NVIDIA's 7 nm card, in which case there will be no contest, both in hardware capabilities and software support.
cuda provides access to specifics of geforce architecture. f.e. in lot of places it depends on 32-wide warps. opencl tries to hide GPU arch differences, so it's more universal
1) Nvidia's Volta V100 has similar FP32 and FP16 and AI deep learning performance but Volta is 21B transistors compared to 13B on Vega. Obviously Volta v100 cannot die-shrink to 7nm easily and feasibly. 2) MI60 is for training and performs slightly better than Volta. It goes head-to-head with Volta in training use-cases. Inference is where tensor cores take over. 3) CUDA with AI is not difficult. We're not talking about a game optimization engine and all that fancy geometry and architecture-dependant draw-calls etc. It's AI training and it's quite straight-forward if you have all the math libraries which currently exist with OpenCL and ROCm. 4) CUDA moat you talk about will be the rope around Nvidia's neck. Even Jensen knows that opensource is the future. Intel will also use OpenCL, Xilinx that currently holds the world record in Inference uses Open CL. Google uses OpenCL. MacOS, Microsoft literally everyone. Android is already Vulkan. 5) Currently AMD doesn't need tensor-cores, Xilinx already has that covered. MI60 and Xilinx solutions are way more cost-effective. Not because margins are lower but because 21B monolith V100 is super expensive to produce. 6) MI60 will most certainly gain traction. I'm certain AMD knows more about their customers and the market than you do. 7) Rome and MI60 will use HSA. This is AMD specific and requires proprietary logic within the CPU and GPU. For large-scale simulation use-cases AMD has a definitive advantage with that. 8)You forgot hardware virtualization. This is unique to AMD's solution. 9)ECC memory
Point is, there's A LOT of things MI60 does better. And architecturally Vega20 is clearly superior in the sense of a smaller footprint, better efficiency and better yield.
"1) Nvidia's Volta V100 has similar FP32 and FP16 and AI deep learning performance but Volta is 21B transistors compared to 13B on Vega. Obviously Volta v100 cannot die-shrink to 7nm easily and feasibly."
It's on 7 nm. It can run at faster clocks while using less power. I'm also curious, did they give us performance per watt comparisons? NVIDIA could shrink Volta, but why would they? NVIDIA will introduce a new architecture on 7 nm. As far as the 21B transistors, that is including 3 NVLink controllers.
"2) MI60 is for training and performs slightly better than Volta. It goes head-to-head with Volta in training use-cases. Inference is where tensor cores take over."
We don't have benchmarks that show that. You think AMD's selected benchmarks mean anything for practical real-world training runs? And even AMD's benchmarks aren't making such a claim, since they are not using the Tensor Cores for the V100. V100 Tensor Cores are for training. This shows you don't know what you are talking about.
"3) CUDA with AI is not difficult. We're not talking about a game optimization engine and all that fancy geometry and architecture-dependant draw-calls etc. It's AI training and it's quite straight-forward if you have all the math libraries which currently exist with OpenCL and ROCm."
Sure, that's what Raja Koduri said years ago with the M25 release and what have we seen since then? It still takes lots of architecture optimization in software libraries to make robust tools. Last I knew CuDNN and CuBLAS were well ahead of anything available for use on AMD architecture. Being "straightforward" is not the issue. The issue is performance.
"4) CUDA moat you talk about will be the rope around Nvidia's neck"
Uh huh. As if NVIDIA doesn't already have OpenCL tools that perform better on NVIDIA's hardware than AMD's OpenCL tools do on AMD's hardware...
"5) Currently AMD doesn't need tensor-cores, Xilinx already has that covered. MI60 and Xilinx solutions are way more cost-effective. Not because margins are lower but because 21B monolith V100 is super expensive to produce."
"6) MI60 will most certainly gain traction. I'm certain AMD knows more about their customers and the market than you do."
You just stating it to be so without giving valid reasons doesn't convince me somehow. You can assume that because AMD paper launches a card that that card will be successful if you want. Just don't read the AMD's history and you'll be fine.
"7) Rome and MI60 will use HSA. This is AMD specific and requires proprietary logic within the CPU and GPU. For large-scale simulation use-cases AMD has a definitive advantage with that."
"8)You forgot hardware virtualization. This is unique to AMD's solution."
AMD has had hardware virtualization for a while and has not gained much market share in the data center visualization virtualization market thus far.
"9)ECC memory"
What about it? Both the V100 and the M60 have it.
"Point is, there's A LOT of things MI60 does better."
Hardware virtualization is a lot? What else have you mentioned?
"And architecturally Vega20 is clearly superior in the sense of a smaller footprint, better efficiency and better yield."
Vega20 is architecturally better because it uses a full node advance and an extra 1 1/2 years to market to match the (theoretical) performance of the V100? That's a strange conclusion. Surely you can see your bias.
Again, Tensor cores are useful for training. And you might want to inform the boards of AMD and Xilinx that they are going to be sharing resources and profits from now on... FPGAs in inference are mmore unproven than even GPUs in inference, btw.
At the same price point, I think a 7 nm GP100 would be preferable for most use cases to the MI60. I doubt AMD has the software stack necessary to make GPU inference that attractive for most workloads. NVIDIA has put a lot of work in neural network optimization compilers, container support, kubernetes support, etc.
The MI60 will never see widespread use. Widespread meaning, say, 5% of the market share for GPU compute. It will be used in small scale for evaluation purposes and perhaps by people for which the hardware virtualization is important but yet still want to put their data in the cloud (which is inherently less secure and more nervy, anyway). It remains to be seen whether the MI60 is a product that allows AMD to begin to get their foot in the door or if it is just another product that will be forgotten to history.
I'm not sure why you think AMD and Xilinx are a team. FPGAs have certainly not captured the inference market, anyway. But, again, Tensor Cores are for training, not just inference. AMD is developing their own Tensor Core-like technology and I guess when they come out with it you will decide it will then suddenly become something necessary for them and when you squint and cock your head at a steep enough angle it will almost look like their solution is better. Don't worry about the cost to make the V100. They can charge a lot more for it than AMD can charge for the MI60 (when it is actually available) because the demand for it is a lot higher.
Infinity Fabric has more bandwidth than NVLink for the GPU to GPU connections. That's an interesting tidbit. I hope this quickly trickles down into consumer cards like it did the Nvidia RTX line. I want to see some shared-memory Crossfire configurations ASAP.
That's not correct. On the V100 NVIDIA has up to 3x 50 GB/s bi-directional links, so that's 150 / 150 Gb/s up/down link (which they refer to as 300 GB/s) in the SXM2 packaging, see [1]. The PCIe boards AFAIK do not have NVLink connectors; the Quadro GV100's however do have two per board [2], hence can communicate with 100/100 GB/s max.
It will be interesting to see how flexible are the topologies AMD will allow and how will those connectors be implemented in dense configurations where as in many applications you can't afford 5U of rackspace just to place cards vertically next to each other for the rigid connectors that they picture.
I guess it's important to remember that this is a cloud design and meant to be shared, for scale in and scale out.
For scale out, it won't be used as a gaming accellerator, but HPC accellerator, where no GPU would ever be big enough anyway. Having fixed (or at least order of 2) sizes makes it less of an effort to tune (and schedule) the HPC applications across the racks.
For scale-in, that is VMs or containers with fractional parts of these chiplets: It makes a lot of sense to hit something that gives you good FHD performance and then profit from the lower cost and higher yields of the smaller chips.
I would think that yields on EUV are much better with smaller chips and AMD is teaching Intel and Nvidia a lesson about bigger chips not necessarily being the smarter solution and that you need to connect properly: Lisa makes it personal :-)
Maybe in future some cpu IO gpu combination... and then only maybe. The separate IO part reduce the speed somewhat compared to monolith. Multi gpu has one big problem. The operation systems does not support multi gpu system directly. You have to write separate drivers to each applications like in sli and crossfire. Osses has supported directly multible cpu a long time so They just work because the os distributes the tasks to different CPUs. We need similar support to os so that it can directly use multible GPUs without manual handling.
Somewhat. OSes have support for multiple homogenous cores with the same access to IO memory etc. Support for multiple heterogeneous CPUS (e.g. multiple sockets, CPUs with internal NUMA domains) is in the same "technically present but requires explicit implementation to actually work with any utility" as multiple GPUs.
Neither Rome nor Instinct address consumer gaming PCs.
These designs are HPC, cloud and within that segment perhaps DC hosted gaming for massive multiplayer setups where the multiplayer world data synchronization is a bigger issue than sending FHD resolution encoded GPU renders at acceptable latencies and bandwidth.
And all OSs I know have a hard time distinguishing a GPU from a printer: Not sure I'd want Windows to try handling scheduling on ~10000 (GPU) cores.
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
26 Comments
Back to Article
shabby - Tuesday, November 6, 2018 - link
3rd slide shows Nfinity Fabric? Is that a typo or is amd trolling nvidia?Samus - Thursday, November 8, 2018 - link
LOL good catch, and hilarious!FredWebsters - Wednesday, December 26, 2018 - link
I think that it is a good catch too!olafgarten - Tuesday, November 6, 2018 - link
It will be interesting to see if AMD can gain any traction in HPC considering how CUDA is deeply ingrained in most applications.Yojimbo - Tuesday, November 6, 2018 - link
If AMD gain any traction it won't be with this product. Even not considering the CUDA moat, this AMD card will be "available" in Q4. In fact, they won't have any significant volume with it then. But regardless, they are targeting data centers and since AMD doesn't have wide-scale deployment of their GPU tech in data centers now, there would be a lengthy verification process before full deployment. Once that would happen, and it would probably be at least 6 months, AMD would, again throwing out NVIDIA's superior software library support and any application code users may already have optimized for NVIDIA's hardware, perform roughly equal to a card that NVIDIA would have had in general availability for over a year and a half. Roughly equal, that is, except for AI training workloads where NVIDIA's Tensor Cores would give NVIDIA an advantage. Furthermore, by the time all that happens NVIDIA will be close to releasing their next-generation data center compute card on 7 nm, which could arrive in late 2019 or in 2020 (in terms of availability. they may announce something at GTC San Jose in March or whenever it will be held, but it wouldn't have real availability until months later, much like this AMD MI60 card). NVIDIA, already having their GPUs in data centers, can get their products verified much faster. This MI60 card might end up having to go toe-to-toe with NVIDIA's 7 nm card, in which case there will be no contest, both in hardware capabilities and software support.ABR - Wednesday, November 7, 2018 - link
Could AMD just implement CUDA, or are there copyright issues there?Bulat Ziganshin - Wednesday, November 7, 2018 - link
cuda provides access to specifics of geforce architecture. f.e. in lot of places it depends on 32-wide warps. opencl tries to hide GPU arch differences, so it's more universalzangheiv - Wednesday, November 7, 2018 - link
Wrong. And I beg to differ:1) Nvidia's Volta V100 has similar FP32 and FP16 and AI deep learning performance but Volta is 21B transistors compared to 13B on Vega. Obviously Volta v100 cannot die-shrink to 7nm easily and feasibly.
2) MI60 is for training and performs slightly better than Volta. It goes head-to-head with Volta in training use-cases. Inference is where tensor cores take over.
3) CUDA with AI is not difficult. We're not talking about a game optimization engine and all that fancy geometry and architecture-dependant draw-calls etc. It's AI training and it's quite straight-forward if you have all the math libraries which currently exist with OpenCL and ROCm.
4) CUDA moat you talk about will be the rope around Nvidia's neck. Even Jensen knows that opensource is the future. Intel will also use OpenCL, Xilinx that currently holds the world record in Inference uses Open CL. Google uses OpenCL. MacOS, Microsoft literally everyone. Android is already Vulkan.
5) Currently AMD doesn't need tensor-cores, Xilinx already has that covered. MI60 and Xilinx solutions are way more cost-effective. Not because margins are lower but because 21B monolith V100 is super expensive to produce.
6) MI60 will most certainly gain traction. I'm certain AMD knows more about their customers and the market than you do.
7) Rome and MI60 will use HSA. This is AMD specific and requires proprietary logic within the CPU and GPU. For large-scale simulation use-cases AMD has a definitive advantage with that.
8)You forgot hardware virtualization. This is unique to AMD's solution.
9)ECC memory
Point is, there's A LOT of things MI60 does better. And architecturally Vega20 is clearly superior in the sense of a smaller footprint, better efficiency and better yield.
Yojimbo - Thursday, November 8, 2018 - link
"1) Nvidia's Volta V100 has similar FP32 and FP16 and AI deep learning performance but Volta is 21B transistors compared to 13B on Vega. Obviously Volta v100 cannot die-shrink to 7nm easily and feasibly."It's on 7 nm. It can run at faster clocks while using less power. I'm also curious, did they give us performance per watt comparisons? NVIDIA could shrink Volta, but why would they? NVIDIA will introduce a new architecture on 7 nm. As far as the 21B transistors, that is including 3 NVLink controllers.
"2) MI60 is for training and performs slightly better than Volta. It goes head-to-head with Volta in training use-cases. Inference is where tensor cores take over."
We don't have benchmarks that show that. You think AMD's selected benchmarks mean anything for practical real-world training runs? And even AMD's benchmarks aren't making such a claim, since they are not using the Tensor Cores for the V100. V100 Tensor Cores are for training. This shows you don't know what you are talking about.
"3) CUDA with AI is not difficult. We're not talking about a game optimization engine and all that fancy geometry and architecture-dependant draw-calls etc. It's AI training and it's quite straight-forward if you have all the math libraries which currently exist with OpenCL and ROCm."
Sure, that's what Raja Koduri said years ago with the M25 release and what have we seen since then? It still takes lots of architecture optimization in software libraries to make robust tools. Last I knew CuDNN and CuBLAS were well ahead of anything available for use on AMD architecture. Being "straightforward" is not the issue. The issue is performance.
"4) CUDA moat you talk about will be the rope around Nvidia's neck"
Uh huh. As if NVIDIA doesn't already have OpenCL tools that perform better on NVIDIA's hardware than AMD's OpenCL tools do on AMD's hardware...
"5) Currently AMD doesn't need tensor-cores, Xilinx already has that covered. MI60 and Xilinx solutions are way more cost-effective. Not because margins are lower but because 21B monolith V100 is super expensive to produce."
"6) MI60 will most certainly gain traction. I'm certain AMD knows more about their customers and the market than you do."
You just stating it to be so without giving valid reasons doesn't convince me somehow. You can assume that because AMD paper launches a card that that card will be successful if you want. Just don't read the AMD's history and you'll be fine.
"7) Rome and MI60 will use HSA. This is AMD specific and requires proprietary logic within the CPU and GPU. For large-scale simulation use-cases AMD has a definitive advantage with that."
"8)You forgot hardware virtualization. This is unique to AMD's solution."
AMD has had hardware virtualization for a while and has not gained much market share in the data center visualization virtualization market thus far.
"9)ECC memory"
What about it? Both the V100 and the M60 have it.
"Point is, there's A LOT of things MI60 does better."
Hardware virtualization is a lot? What else have you mentioned?
"And architecturally Vega20 is clearly superior in the sense of a smaller footprint, better efficiency and better yield."
Vega20 is architecturally better because it uses a full node advance and an extra 1 1/2 years to market to match the (theoretical) performance of the V100? That's a strange conclusion. Surely you can see your bias.
Again, Tensor cores are useful for training. And you might want to inform the boards of AMD and Xilinx that they are going to be sharing resources and profits from now on... FPGAs in inference are mmore unproven than even GPUs in inference, btw.
At the same price point, I think a 7 nm GP100 would be preferable for most use cases to the MI60. I doubt AMD has the software stack necessary to make GPU inference that attractive for most workloads. NVIDIA has put a lot of work in neural network optimization compilers, container support, kubernetes support, etc.
The MI60 will never see widespread use. Widespread meaning, say, 5% of the market share for GPU compute. It will be used in small scale for evaluation purposes and perhaps by people for which the hardware virtualization is important but yet still want to put their data in the cloud (which is inherently less secure and more nervy, anyway). It remains to be seen whether the MI60 is a product that allows AMD to begin to get their foot in the door or if it is just another product that will be forgotten to history.
Yojimbo - Thursday, November 8, 2018 - link
Oh I missed #5:I'm not sure why you think AMD and Xilinx are a team. FPGAs have certainly not captured the inference market, anyway. But, again, Tensor Cores are for training, not just inference. AMD is developing their own Tensor Core-like technology and I guess when they come out with it you will decide it will then suddenly become something necessary for them and when you squint and cock your head at a steep enough angle it will almost look like their solution is better. Don't worry about the cost to make the V100. They can charge a lot more for it than AMD can charge for the MI60 (when it is actually available) because the demand for it is a lot higher.
Yojimbo - Saturday, November 10, 2018 - link
By the way, here is Resnet-50 training using the latest NVIDIA toolchain and the Tensor Cores on their GPUs:http://3s81si1s5ygj3mzby34dq6qf-wpengine.netdna-ss...
wingless - Tuesday, November 6, 2018 - link
Infinity Fabric has more bandwidth than NVLink for the GPU to GPU connections. That's an interesting tidbit. I hope this quickly trickles down into consumer cards like it did the Nvidia RTX line. I want to see some shared-memory Crossfire configurations ASAP.pSz - Tuesday, November 6, 2018 - link
That's not correct. On the V100 NVIDIA has up to 3x 50 GB/s bi-directional links, so that's 150 / 150 Gb/s up/down link (which they refer to as 300 GB/s) in the SXM2 packaging, see [1]. The PCIe boards AFAIK do not have NVLink connectors; the Quadro GV100's however do have two per board [2], hence can communicate with 100/100 GB/s max.It will be interesting to see how flexible are the topologies AMD will allow and how will those connectors be implemented in dense configurations where as in many applications you can't afford 5U of rackspace just to place cards vertically next to each other for the rigid connectors that they picture.
[1] https://www.nvidia.com/en-us/design-visualization/...
[2] https://www.nvidia.com/en-us/design-visualization/...
p1esk - Tuesday, November 6, 2018 - link
2080Ti has NVLink (100GB/s).Yojimbo - Tuesday, November 6, 2018 - link
The 2080 Ti only has 2 NVLink connectors. But why would you compare the MI60 to the 2080Ti rather than the Tesla V100?skavi - Wednesday, November 7, 2018 - link
How reasonable is a chiplet design for GPUs? I can speculate on this myself, so please only answer if you have real insight.abufrejoval - Wednesday, November 7, 2018 - link
I guess it's important to remember that this is a cloud design and meant to be shared, for scale in and scale out.For scale out, it won't be used as a gaming accellerator, but HPC accellerator, where no GPU would ever be big enough anyway. Having fixed (or at least order of 2) sizes makes it less of an effort to tune (and schedule) the HPC applications across the racks.
For scale-in, that is VMs or containers with fractional parts of these chiplets: It makes a lot of sense to hit something that gives you good FHD performance and then profit from the lower cost and higher yields of the smaller chips.
I would think that yields on EUV are much better with smaller chips and AMD is teaching Intel and Nvidia a lesson about bigger chips not necessarily being the smarter solution and that you need to connect properly: Lisa makes it personal :-)
abufrejoval - Wednesday, November 7, 2018 - link
Failed to mention: Scale in is where I see them hitting the new cloud gaming target.haukionkannel - Wednesday, November 7, 2018 - link
Maybe in future some cpu IO gpu combination... and then only maybe. The separate IO part reduce the speed somewhat compared to monolith.Multi gpu has one big problem. The operation systems does not support multi gpu system directly. You have to write separate drivers to each applications like in sli and crossfire. Osses has supported directly multible cpu a long time so They just work because the os distributes the tasks to different CPUs. We need similar support to os so that it can directly use multible GPUs without manual handling.
edzieba - Wednesday, November 7, 2018 - link
Somewhat. OSes have support for multiple homogenous cores with the same access to IO memory etc. Support for multiple heterogeneous CPUS (e.g. multiple sockets, CPUs with internal NUMA domains) is in the same "technically present but requires explicit implementation to actually work with any utility" as multiple GPUs.abufrejoval - Wednesday, November 7, 2018 - link
Neither Rome nor Instinct address consumer gaming PCs.These designs are HPC, cloud and within that segment perhaps DC hosted gaming for massive multiplayer setups where the multiplayer world data synchronization is a bigger issue than sending FHD resolution encoded GPU renders at acceptable latencies and bandwidth.
And all OSs I know have a hard time distinguishing a GPU from a printer: Not sure I'd want Windows to try handling scheduling on ~10000 (GPU) cores.
MrSpadge - Wednesday, November 7, 2018 - link
It costs power for longer data transmission lines, but helps with yields and scalability (if done right). So yes, feasible I would say.Xajel - Wednesday, November 7, 2018 - link
For AI, ML, compute, data centres, etc.. It's very practical.But for games, not so well.
Yojimbo - Wednesday, November 7, 2018 - link
Both NVIDIA and AMD have been considering and developing multi-chip module GPUs for some time. Here is a paper published by NVIDIA in 2017: https://research.nvidia.com/publication/2017-06_MC...zangheiv - Wednesday, November 7, 2018 - link
you need to speculate first to start the discussion.The Hardcard - Wednesday, November 7, 2018 - link
In June, Anton made a rough estimate of die size from a photo at a demo. He estimated 336 mm2. The actual is 331 mm2.That’s carny-level skills.