Good riddance, IMO. There are good reasons IEEE 754 used a different bit-allocation for their half-precision format.
Given that bloat16 isn't useful for much beyond deep learning, I think it'd be a good call to drop it. It's a mistake to burn CPU cycles on deep learning, when GPUs do it far more efficiently. Keep in mind that around the time of Ice Lake SP's introduction, Intel should have its own GPUs. Not to mention Nervana's chips and Altera's FPGAs.
But Intel is pursuing a strategy of some sort of unified code across their product lines. So their CPUs should be able to run that code, even if it does so more slowly than an accelerator, or a fracture would be created and the strategy would seem suspect. There is likely to be a market for machine learning on CPU-only servers beyond 2020 and 1) if bfloat16 can help its performance then it would be good for it to be there and 2) it would be a problem if this market's codebase was separated from that of the rest of Intel's constellation because of the lack of compatibility with the format.
Of course, I wonder how useful such a unified system will be when dealing with very different underlying architectures. Maybe in a relatively narrow field like deep learning they can make it work, but if things start to require more vertical integration I would think that to get good performance with any sort of portability everything would need to be heavily re-optimized for each architecture and even mix of architectures. That would seem to defeat the purpose and presumably the heavy machine-learning only applications should be running almost entirely on Nervana hardware, anyway, assuming they believe in their Nervana accelerator.
Have they even said much about oneAPI? I doubt it can mean something like bfloat16 everywhere, because that would instantly obsolete all of their existing products. Perhaps they're talking about higher-level interfaces that don't necessarily expose implementation-specific datatypes.
>But Intel is pursuing a strategy of some sort of unified code across their product lines. So their CPUs should be able to run that code, even if it does so more slowly than an accelerator,
Intel's compiler has been able to do this for 20 years, at least since SSE, maybe even MMX . You don't need to support all instructions on all devices so long as your development tools do proper runtime checking.
Intel seems to be marketing their platform as a method of portability, they aren't marketing the capability of their compiler to target all their architectures. Something like bfloat16 most likely completely changes the parameters of what makes an effective neural network. So if Intel will market their CPUs for deep learning training, will market bfloat16 as the preferred data type for deep learning training, will market their platform for portability between their architectures, then how could they leave bfloat16 out of their CPUs? Either the CPUs will have bfloat16 or the situation Intel is presenting to their customers isn't as smooth as their marketing team has recently suggested it would be.
It is not a mistake to use a CPU for deep learning even though GPUs may do a better job. For one thing, for occasional deep learning it doesn't make sense to invest in the overpriced NVIDIA GPUs. And a CPU has a distinct advantage: its memory may be a lot slower than on a GPU, but it has one big memory pool that no GPU solution can beat. And offering alternatives do keep the pressure on those who do make AI accelerators to keep the pricing reasonable.
blfloat16 is as useful as the IEEE half precision format or maybe even more useful. I know IEEE half precision is supported in some graphics or acceleration frameworks, and I am fully aware that it is a prime citizen in the iPhone hardware. But so far it look like in typical PC applications, areas where you would expect to see it actually use 32-bit floating point and in a number of cases even 64-bit floating point. And for scientific computing IEEE half precision is useless.
The true reason the instructions are not in Ice Lake is because Ice Lake was scheduled to be launched much earlier and its design was probably more or less finished quite a while ago. Cooper Lake is a quick stopgap adding some additional support for deep learning on top of what Cascade Lake offers and was probably easier to implement. Especially if the rumours are true that we actually won't see high core count Ice Lake CPUs for a while due to too low yields on 10nm. But I'd expect to see these new instructions on an Ice Lake successor.
For the most part, memory speed is vastly more important for deep learning than size.
I think the main reason half-precision didn't see much use was a chicken-and-egg problem. Game developers didn't go out of their way to use it, because most GPUs implemented only token support. And GPUs didn't bother implementing proper support since no games used it. But since Broadwell, Vega, and Turing that has completely changed - the big 3 makers of PC GPUs now offer twice the fp16 throughput as their fp32. We've already seen some games start taking advantage of that.
As for fp16 not being useful for scientific computing, you can find papers on how the Nvidia V100's tensor cores (which support only fp16 w/ fp32 accumulate) have been harnessed in certain scientific applications, yielding greater performance than if the same computations were run in the conventional way, on the same chip.
Finally, I had the same suspicion about Ice Lake as you say. We knew that Intel had some 10 nm designs that were just waiting on their 10 nm node to enter full production.
To be honest, I don't really mind if bfloat16 sticks around - I just don't want to see IEEE 754 half-precision disappear, as I feel that format is more usable for many applications.
I know it is still a new tech direction, but I am hoping Anandtech or someone will soon get more in depth on training and inference. Hopefully there will be tools developed that will allow an idea of the relative capabilities of the approaches of these companies, from smartphone SOCs, through general CPUs and GPUs, to specialized accelerators.
This seems to be a significant proportion of nearly every company’s R & D budget currently.IBM is working with a float8 that they claim loses virtually no accuracy. I hope there can be more articles about this, from the advantage of floating point over integer, to the remarkable lack of a need for precision that using float8 entails. Can it really stand up to FP32?
It seems to me that training benchmarks are harder than inference, especially when novel approaches are involved. The reason being that techniques like reducing precision usually involve some trade-off between speed and accuracy. So, you'd have to define some stopping condition, like a convergence threshold, and then report how long it took to reach that point and what level of accuracy the resulting model delivered.
Inferencing with reduced precision is a little similar, but at least you don't have the challenge of defining what it is you actually want to test. Forward propagation is straight forward enough. But you still have to measure both speed *and* accuracy.
I do see value in dissecting different companies' approaches, not unlike Anandtech's excellent coverage of Nvidia's Tensor cores. That said, there's not really much depth to some of the approaches, to the extent they simply boil down to a slight variation on existing FP formats.
int8 and int4 coverage could be more interesting, since you could get into the area of the layer types for which they're most applicable and the tradeoffs of using them (i.e. not simply accuracy, but maybe you've also got to use larger layers) and the typical speeds that result. But, that has more to do with how various software frameworks utilize the underlying hardware capabilities than with the hardware, itself.
As for IBM's FP8, this describes how that had to reformulate or refactor aspects of the training problem, in order to utilize significantly less than 16-bits for training:
I just read about flexpoint and it seems the idea is to do 16-bit integer operations on the accelerator and allow those integers to express values in ranges defined by a 5 bit exponent held on the host device and (implicitly) shared by the various accelerator processing elements. Thus they get very energy-efficient and die-space efficient processing elements but have the flexibility to deal with larger dynamic range than an integer unit can provide.
That's a very different method of dealing with the problem than bfloat16, which actually reduces the number of significands from 10 in half-precision to 7 in bfloat16 in order to increase the exponent from 5 bits to 8 bits to provide more dynamic range. I wonder how easily code written to create an accurate network on an architecture using the one data type could be ported to an architecture on the other system. It sounds problematic to me.
> I wonder how easily code written to create an accurate network on an architecture using the one data type could be ported to an architecture on the other system. It sounds problematic to me.
You realize that people are inferencing with int-8 (and less) models trained with fp16, right? Nvidia started this trend with their Pascal GP102+ chips and doubled down on it with Turing's Tensor cores (which, unlike Volta's, also support int8 and I think int4). AMD jumped on the int8/int4 bandwagon, with Vega 20 (i.e. 7nm).
I'm not sure what you mean. I am talking about code portability, not about compiling trained networks. bfloat16 and flexpoint are both floating point implementations and therefore, as you seem to agree with, targeted primarily towards training and not inference.
It is remarkable that IBM is doing training at FP8, even if they had to modify their training algorithms. I thought such precision was possible only for inference. Apparently since inference is already at INT4 training had to follow suit. In the blog you linked they say they intend to go even further in the future, with inference at .. 2-bit(?!) and training at an unspecified <8-bit (6-bit?).
I'd wager that Intel is eager to add BF16 support, but they weren't *planning* on it originally. Therefore tweaking an existing proven design to have BF16 support in the short term is easier than adding BF16 support to Icelake, which ships later but needs significantly more qualification testing due to being a huge redesign.
I believe Intel has recently dropped their original flexpoint idea from the NNP-L1000 training chip, replacing it with bfloat16. It is stated in the fuse.wikichip article from April 14, 2019.
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
19 Comments
Back to Article
mode_13h - Friday, April 5, 2019 - link
Good riddance, IMO. There are good reasons IEEE 754 used a different bit-allocation for their half-precision format.Given that bloat16 isn't useful for much beyond deep learning, I think it'd be a good call to drop it.
It's a mistake to burn CPU cycles on deep learning, when GPUs do it far more efficiently. Keep in mind that around the time of Ice Lake SP's introduction, Intel should have its own GPUs. Not to mention Nervana's chips and Altera's FPGAs.
Yojimbo - Saturday, April 6, 2019 - link
But Intel is pursuing a strategy of some sort of unified code across their product lines. So their CPUs should be able to run that code, even if it does so more slowly than an accelerator, or a fracture would be created and the strategy would seem suspect. There is likely to be a market for machine learning on CPU-only servers beyond 2020 and 1) if bfloat16 can help its performance then it would be good for it to be there and 2) it would be a problem if this market's codebase was separated from that of the rest of Intel's constellation because of the lack of compatibility with the format.Of course, I wonder how useful such a unified system will be when dealing with very different underlying architectures. Maybe in a relatively narrow field like deep learning they can make it work, but if things start to require more vertical integration I would think that to get good performance with any sort of portability everything would need to be heavily re-optimized for each architecture and even mix of architectures. That would seem to defeat the purpose and presumably the heavy machine-learning only applications should be running almost entirely on Nervana hardware, anyway, assuming they believe in their Nervana accelerator.
mode_13h - Saturday, April 6, 2019 - link
Have they even said much about oneAPI? I doubt it can mean something like bfloat16 everywhere, because that would instantly obsolete all of their existing products. Perhaps they're talking about higher-level interfaces that don't necessarily expose implementation-specific datatypes.I guess time will tell.
saratoga4 - Saturday, April 6, 2019 - link
>But Intel is pursuing a strategy of some sort of unified code across their product lines. So their CPUs should be able to run that code, even if it does so more slowly than an accelerator,Intel's compiler has been able to do this for 20 years, at least since SSE, maybe even MMX . You don't need to support all instructions on all devices so long as your development tools do proper runtime checking.
Yojimbo - Sunday, April 7, 2019 - link
Intel seems to be marketing their platform as a method of portability, they aren't marketing the capability of their compiler to target all their architectures. Something like bfloat16 most likely completely changes the parameters of what makes an effective neural network. So if Intel will market their CPUs for deep learning training, will market bfloat16 as the preferred data type for deep learning training, will market their platform for portability between their architectures, then how could they leave bfloat16 out of their CPUs? Either the CPUs will have bfloat16 or the situation Intel is presenting to their customers isn't as smooth as their marketing team has recently suggested it would be.Yojimbo - Sunday, April 7, 2019 - link
A third possibility is of course that I am misinterpreting what Intel is saying...KurtL - Tuesday, April 9, 2019 - link
It is not a mistake to use a CPU for deep learning even though GPUs may do a better job. For one thing, for occasional deep learning it doesn't make sense to invest in the overpriced NVIDIA GPUs. And a CPU has a distinct advantage: its memory may be a lot slower than on a GPU, but it has one big memory pool that no GPU solution can beat. And offering alternatives do keep the pressure on those who do make AI accelerators to keep the pricing reasonable.blfloat16 is as useful as the IEEE half precision format or maybe even more useful. I know IEEE half precision is supported in some graphics or acceleration frameworks, and I am fully aware that it is a prime citizen in the iPhone hardware. But so far it look like in typical PC applications, areas where you would expect to see it actually use 32-bit floating point and in a number of cases even 64-bit floating point. And for scientific computing IEEE half precision is useless.
The true reason the instructions are not in Ice Lake is because Ice Lake was scheduled to be launched much earlier and its design was probably more or less finished quite a while ago. Cooper Lake is a quick stopgap adding some additional support for deep learning on top of what Cascade Lake offers and was probably easier to implement. Especially if the rumours are true that we actually won't see high core count Ice Lake CPUs for a while due to too low yields on 10nm. But I'd expect to see these new instructions on an Ice Lake successor.
mode_13h - Wednesday, April 10, 2019 - link
For the most part, memory speed is vastly more important for deep learning than size.I think the main reason half-precision didn't see much use was a chicken-and-egg problem. Game developers didn't go out of their way to use it, because most GPUs implemented only token support. And GPUs didn't bother implementing proper support since no games used it. But since Broadwell, Vega, and Turing that has completely changed - the big 3 makers of PC GPUs now offer twice the fp16 throughput as their fp32. We've already seen some games start taking advantage of that.
As for fp16 not being useful for scientific computing, you can find papers on how the Nvidia V100's tensor cores (which support only fp16 w/ fp32 accumulate) have been harnessed in certain scientific applications, yielding greater performance than if the same computations were run in the conventional way, on the same chip.
Finally, I had the same suspicion about Ice Lake as you say. We knew that Intel had some 10 nm designs that were just waiting on their 10 nm node to enter full production.
To be honest, I don't really mind if bfloat16 sticks around - I just don't want to see IEEE 754 half-precision disappear, as I feel that format is more usable for many applications.
The Hardcard - Friday, April 5, 2019 - link
I know it is still a new tech direction, but I am hoping Anandtech or someone will soon get more in depth on training and inference. Hopefully there will be tools developed that will allow an idea of the relative capabilities of the approaches of these companies, from smartphone SOCs, through general CPUs and GPUs, to specialized accelerators.This seems to be a significant proportion of nearly every company’s R & D budget currently.IBM is working with a float8 that they claim loses virtually no accuracy. I hope there can be more articles about this, from the advantage of floating point over integer, to the remarkable lack of a need for precision that using float8 entails. Can it really stand up to FP32?
mode_13h - Friday, April 5, 2019 - link
It seems to me that training benchmarks are harder than inference, especially when novel approaches are involved. The reason being that techniques like reducing precision usually involve some trade-off between speed and accuracy. So, you'd have to define some stopping condition, like a convergence threshold, and then report how long it took to reach that point and what level of accuracy the resulting model delivered.Inferencing with reduced precision is a little similar, but at least you don't have the challenge of defining what it is you actually want to test. Forward propagation is straight forward enough. But you still have to measure both speed *and* accuracy.
I do see value in dissecting different companies' approaches, not unlike Anandtech's excellent coverage of Nvidia's Tensor cores. That said, there's not really much depth to some of the approaches, to the extent they simply boil down to a slight variation on existing FP formats.
int8 and int4 coverage could be more interesting, since you could get into the area of the layer types for which they're most applicable and the tradeoffs of using them (i.e. not simply accuracy, but maybe you've also got to use larger layers) and the typical speeds that result. But, that has more to do with how various software frameworks utilize the underlying hardware capabilities than with the hardware, itself.
As for IBM's FP8, this describes how that had to reformulate or refactor aspects of the training problem, in order to utilize significantly less than 16-bits for training:
https://www.ibm.com/blogs/research/2018/12/8-bit-p...
On a related note, Nervana (now part of Intel) introduced what they call "flexpoint":
https://www.intel.ai/flexpoint-numerical-innovatio...
Yojimbo - Saturday, April 6, 2019 - link
I just read about flexpoint and it seems the idea is to do 16-bit integer operations on the accelerator and allow those integers to express values in ranges defined by a 5 bit exponent held on the host device and (implicitly) shared by the various accelerator processing elements. Thus they get very energy-efficient and die-space efficient processing elements but have the flexibility to deal with larger dynamic range than an integer unit can provide.That's a very different method of dealing with the problem than bfloat16, which actually reduces the number of significands from 10 in half-precision to 7 in bfloat16 in order to increase the exponent from 5 bits to 8 bits to provide more dynamic range. I wonder how easily code written to create an accurate network on an architecture using the one data type could be ported to an architecture on the other system. It sounds problematic to me.
mode_13h - Wednesday, April 10, 2019 - link
> I wonder how easily code written to create an accurate network on an architecture using the one data type could be ported to an architecture on the other system. It sounds problematic to me.You realize that people are inferencing with int-8 (and less) models trained with fp16, right? Nvidia started this trend with their Pascal GP102+ chips and doubled down on it with Turing's Tensor cores (which, unlike Volta's, also support int8 and I think int4). AMD jumped on the int8/int4 bandwagon, with Vega 20 (i.e. 7nm).
Yojimbo - Wednesday, April 10, 2019 - link
I'm not sure what you mean. I am talking about code portability, not about compiling trained networks. bfloat16 and flexpoint are both floating point implementations and therefore, as you seem to agree with, targeted primarily towards training and not inference.Santoval - Saturday, April 6, 2019 - link
It is remarkable that IBM is doing training at FP8, even if they had to modify their training algorithms. I thought such precision was possible only for inference. Apparently since inference is already at INT4 training had to follow suit. In the blog you linked they say they intend to go even further in the future, with inference at .. 2-bit(?!) and training at an unspecified <8-bit (6-bit?).Bulat Ziganshin - Sunday, April 7, 2019 - link
humans perform inference at single-bit precision (like/hate) and became most successful specie. May be, IBM just revealed this secret? :Dmode_13h - Wednesday, April 10, 2019 - link
It's funny you mention that. Spiking neural networks are more closely designed to model natural ones.One-bit neural networks are an ongoing area of research. I really haven't followed them, but they do seem to come up, now and then.
Elstar - Monday, April 8, 2019 - link
I'd wager that Intel is eager to add BF16 support, but they weren't *planning* on it originally. Therefore tweaking an existing proven design to have BF16 support in the short term is easier than adding BF16 support to Icelake, which ships later but needs significantly more qualification testing due to being a huge redesign.mode_13h - Wednesday, April 10, 2019 - link
Probably right.JayNor - Saturday, June 29, 2019 - link
I believe Intel has recently dropped their original flexpoint idea from the NNP-L1000 training chip, replacing it with bfloat16. It is stated in the fuse.wikichip article from April 14, 2019.