Yeah, that was pretty weak, Ian. You should've made two rows - training TFLOPS and inferencing TOPS. The 64 (or 83.2) TFLOPS of half-precision performance are clearly meant to address the same purposes as V100's fp16 tensor cores.
And do we know that V100 lacks the 8-bit integer dot product found in most of their Pascal GPUs?
The V100 has the 8-bit integer operations. But the theoretical peak for the operation is 60 TOPS for the V100, less than the Tensor Core peak of 120 TOPS. So there's probably no reason to use it, as the Tensor Cores give better precision and faster execution.
Of course, theoretical comparisons of such different chips is not very useful. We need real application benchmarks. But yeah, the comparisons in the chart are very wrong.
Firstly, NVIDIA's half-precision FLOPS are general purpose FMAs. I get the idea that this ASIC's half-precious FLOPS are not, but are rather more akin to the Tensor Core FLOPS on the V100. The reason I say this is because the chip only has 102.4 GB/s of memory bandwidth, so all those FLOPS would be useless in most general purpose applications. They need specialized algorithms able to reuse data for high compute density to have any hope of taking advantage of those FLOPS with that bandwidth.
Secondly, 8-bit TOPS should not generically be compared to NVIDIA's Tensor Core FLOPS under "deep learning". 8-bit quantization cannot be used for training and even in inference is only successfully used sometimes. The Tensor Cores can be used for more inference applications than 8-bit integer and can be used for training as well.
Thirdly, NVIDIA Tensor Core implementation is 16 bit multiply with 32 bit accumulate, which is superior to 16 bit arithmetic, and this difference has been shown in research to be important.
So, from the best I can tell, the proper comparison should be the V100's Tensor Core "Deep Learning" numbers with the Cambricon chip's "Half Precision" numbers, with the caveat that the Tensor Cores provide potentially better accuracy because of the 32-bit accumulate. The V100's Tensor Core numbers can also be compared with the Cambricon chips 8-bit numbers for inference applications, but it should be noted that mixed precision floating point is being compared to 8-bit integer in that case.
The other elephant in the room for both these chips when discussing FLOPS/OPS is the memory bandwidth needed to feed the execution units. For the V100 tensorcores, you are doing really well if you can get even 40-50% of theoretical flops (I’m a cuda dev and have talked with some Nvidia guys) because you cant get enough data in from memory to feed the cores. Even the Nvidia gurus who write “assembly” code for cuDNN only get near peak performance in very specific and limited circumstances, which involve loading a tensorcore with data and then reusing it 10+ times.
That's always the case, though. That's why you have to test hardware on real applications. Here the difference in bandwidth is so great that I would guess results would vary wildly depending on the test. But people still compare theoretical specs of products.
Thanks for this comment; it really helps clear up the comparison between the two. I think If these can hit 30% of the performance of the Nvidia cards, they still might have great price performance if Huawei can manage to sell them profitably at ~10% of the price.
thats insane, but just goes to show what is capable with a focused design instead of trying to pick and choose what it will support or not support in your face ngreedia....that being said, I can only imagine what kind of insane design AMD could come up with when/if it says "f it, make an all out compute specific graphics accelerator" because as it stands they are guilty as well of trying to "do it all".
with AMD I can "kind of" understand because they have many frying pans in many fires, Nv has nowhere near that same excuse because they have ALWAYS (to my knowledge) been focused on graphics and graphics alone (very short amount of time with motherboard chipsets which they very much did as little as possible and overpriced for what was given (load of issues, overheating etc)
anyways.....that is some serious horsepower on these considering very low amount of memory bandwidth and TDP...I so wish they would get rid of that stupid designation because they can easily blow past this amount or be under it 9/10....they need a "watt limited" number or something like that instead ^.^
You might be reading too much into the headline numbers. There are a couple of key phrases, like:
"All this data relies on sparse data modes being enabled."
Which tells me their headline numbers might not be sustainable, in the general case, and perhaps you need exactly the right network architecture to get close to the theoretical compute numbers.
It does make one wonder whether Nvidia will try excluding things like fp64 and geometry engines, to make a truly purpose-build deep learning chip.
I think it's important to note that the Volta line (including the V100) actually have a specialized Ai Tensor Cores that actually mean the AI speed is approximately 120 TFLOPS not the 30 TFLOPS for generic CUDA.
Came here to say the same thing. VRAM = Video RAM, right? There's nothing video related here (even if the chip were used for facial recognition or other video-based work), so calling it "compute RAM" or something like that would seem more fitting, no?
Not the firs time a term became bastardized and it won't be the last. It's not being used for video anything in the V100, either, under the contexts in which the comparisons are being made.
The important thing is that it is on-board RAM and not system RAM.
"Memory clock"... Yeah, 1600 is strictly correct (DDR4 isn't, that's a technology), what it means tho is DDR4-3200 memory. That table could've been so much better if a bit more thought had gone into naming the rows.
Yeah, that article sucks. There's no mention of the huge AMD IP licensing deal, nor of Sunway TaihuLight with its custom vector processors.
There's also no consideration given to the fact that major US semiconductor companies (including both Nvidia and AMD) have had design centers there for more than a decade, which have surely helped to train up the workforce, there.
Vector processors and deep learning chips aren't exactly the hardest things to design, but I think China's semiconductor industry is on a lot of people's radar screens for a while, now.
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
25 Comments
Back to Article
Bulat Ziganshin - Saturday, May 26, 2018 - link
Obviously, 8-bit integer and 16/32 FP operations cannot be directly comparedmode_13h - Saturday, May 26, 2018 - link
Yeah, that was pretty weak, Ian. You should've made two rows - training TFLOPS and inferencing TOPS. The 64 (or 83.2) TFLOPS of half-precision performance are clearly meant to address the same purposes as V100's fp16 tensor cores.And do we know that V100 lacks the 8-bit integer dot product found in most of their Pascal GPUs?
Yojimbo - Sunday, May 27, 2018 - link
The V100 has the 8-bit integer operations. But the theoretical peak for the operation is 60 TOPS for the V100, less than the Tensor Core peak of 120 TOPS. So there's probably no reason to use it, as the Tensor Cores give better precision and faster execution.Of course, theoretical comparisons of such different chips is not very useful. We need real application benchmarks. But yeah, the comparisons in the chart are very wrong.
Firstly, NVIDIA's half-precision FLOPS are general purpose FMAs. I get the idea that this ASIC's half-precious FLOPS are not, but are rather more akin to the Tensor Core FLOPS on the V100. The reason I say this is because the chip only has 102.4 GB/s of memory bandwidth, so all those FLOPS would be useless in most general purpose applications. They need specialized algorithms able to reuse data for high compute density to have any hope of taking advantage of those FLOPS with that bandwidth.
Secondly, 8-bit TOPS should not generically be compared to NVIDIA's Tensor Core FLOPS under "deep learning". 8-bit quantization cannot be used for training and even in inference is only successfully used sometimes. The Tensor Cores can be used for more inference applications than 8-bit integer and can be used for training as well.
Thirdly, NVIDIA Tensor Core implementation is 16 bit multiply with 32 bit accumulate, which is superior to 16 bit arithmetic, and this difference has been shown in research to be important.
So, from the best I can tell, the proper comparison should be the V100's Tensor Core "Deep Learning" numbers with the Cambricon chip's "Half Precision" numbers, with the caveat that the Tensor Cores provide potentially better accuracy because of the 32-bit accumulate. The V100's Tensor Core numbers can also be compared with the Cambricon chips 8-bit numbers for inference applications, but it should be noted that mixed precision floating point is being compared to 8-bit integer in that case.
steve_musk - Sunday, May 27, 2018 - link
The other elephant in the room for both these chips when discussing FLOPS/OPS is the memory bandwidth needed to feed the execution units. For the V100 tensorcores, you are doing really well if you can get even 40-50% of theoretical flops (I’m a cuda dev and have talked with some Nvidia guys) because you cant get enough data in from memory to feed the cores. Even the Nvidia gurus who write “assembly” code for cuDNN only get near peak performance in very specific and limited circumstances, which involve loading a tensorcore with data and then reusing it 10+ times.Yojimbo - Sunday, May 27, 2018 - link
That's always the case, though. That's why you have to test hardware on real applications. Here the difference in bandwidth is so great that I would guess results would vary wildly depending on the test. But people still compare theoretical specs of products.mode_13h - Tuesday, May 29, 2018 - link
That's why people use cuDNN and batching.Bizwacky - Wednesday, May 30, 2018 - link
Thanks for this comment; it really helps clear up the comparison between the two. I think If these can hit 30% of the performance of the Nvidia cards, they still might have great price performance if Huawei can manage to sell them profitably at ~10% of the price.Pork@III - Saturday, May 26, 2018 - link
Poor Nvidia, poor Tesla, poor green fensbananaforscale - Sunday, May 27, 2018 - link
Fens?Pork@III - Sunday, May 27, 2018 - link
Ok, zombies :)Dragonstongue - Saturday, May 26, 2018 - link
thats insane, but just goes to show what is capable with a focused design instead of trying to pick and choose what it will support or not support in your face ngreedia....that being said, I can only imagine what kind of insane design AMD could come up with when/if it says "f it, make an all out compute specific graphics accelerator" because as it stands they are guilty as well of trying to "do it all".with AMD I can "kind of" understand because they have many frying pans in many fires, Nv has nowhere near that same excuse because they have ALWAYS (to my knowledge) been focused on graphics and graphics alone (very short amount of time with motherboard chipsets which they very much did as little as possible and overpriced for what was given (load of issues, overheating etc)
anyways.....that is some serious horsepower on these considering very low amount of memory bandwidth and TDP...I so wish they would get rid of that stupid designation because they can easily blow past this amount or be under it 9/10....they need a "watt limited" number or something like that instead ^.^
mode_13h - Tuesday, May 29, 2018 - link
You might be reading too much into the headline numbers. There are a couple of key phrases, like:"All this data relies on sparse data modes being enabled."
Which tells me their headline numbers might not be sustainable, in the general case, and perhaps you need exactly the right network architecture to get close to the theoretical compute numbers.
It does make one wonder whether Nvidia will try excluding things like fp64 and geometry engines, to make a truly purpose-build deep learning chip.
bryanlyon - Saturday, May 26, 2018 - link
I think it's important to note that the Volta line (including the V100) actually have a specialized Ai Tensor Cores that actually mean the AI speed is approximately 120 TFLOPS not the 30 TFLOPS for generic CUDA.ToTTenTranz - Sunday, May 27, 2018 - link
Regarding the chart, I wouldn't say Cambricon's solution uses "VRAM".Valantar - Sunday, May 27, 2018 - link
Came here to say the same thing. VRAM = Video RAM, right? There's nothing video related here (even if the chip were used for facial recognition or other video-based work), so calling it "compute RAM" or something like that would seem more fitting, no?Yojimbo - Sunday, May 27, 2018 - link
Not the firs time a term became bastardized and it won't be the last. It's not being used for video anything in the V100, either, under the contexts in which the comparisons are being made.The important thing is that it is on-board RAM and not system RAM.
bananaforscale - Sunday, May 27, 2018 - link
"Memory clock"... Yeah, 1600 is strictly correct (DDR4 isn't, that's a technology), what it means tho is DDR4-3200 memory. That table could've been so much better if a bit more thought had gone into naming the rows.wr3zzz - Monday, May 28, 2018 - link
How big a hurdle is it from making something like this to GPU?vladx - Monday, May 28, 2018 - link
Too high to be worth entering a very competitive market like GPU market as a new player today.mode_13h - Tuesday, May 29, 2018 - link
Have you heard of a company called Vivante?vladx - Tuesday, May 29, 2018 - link
That's why I mentioned "as a new player", Vivante is in the GPU business since 2007.Threska - Monday, May 28, 2018 - link
I find articles like this in the context of articles like this interesting. Apparently China CAN do semiconductors.https://www.bloomberg.com/view/articles/2018-04-29...
mode_13h - Tuesday, May 29, 2018 - link
Yeah, that article sucks. There's no mention of the huge AMD IP licensing deal, nor of Sunway TaihuLight with its custom vector processors.There's also no consideration given to the fact that major US semiconductor companies (including both Nvidia and AMD) have had design centers there for more than a decade, which have surely helped to train up the workforce, there.
Vector processors and deep learning chips aren't exactly the hardest things to design, but I think China's semiconductor industry is on a lot of people's radar screens for a while, now.
peevee - Wednesday, May 30, 2018 - link
"Cambricon Technologies, the company in collaboration with HiSilicon / Huawei for licensing specialist AI silicon intellectual property"Is it Google Translate from Chinese?