Hopefully Intel makes good drivers. Perhaps with intel's massive budget they could make a competitor to CUDA, perhaps it can even be open source. Long shot though.
I don't think it matters if it is open source or not. There are plenty of cross-compilers for CUDA code. The issue is that you can't just make a compiler that takes CUDA code optimized for NVIDIA's hardware that then optimizes it for some other hardware and expect that code to run well. You still have to rewrite a lot of the code to fit the other architecture, and CUDA wasn't designed to support other architectures. Now, if Intel did make their API open source, they still aren't going to be adding support to it for rival architectures, and it will still take hand optimization to effectively target those other architectures if they or someone else did, so the issue is much the same.
There was the same issue with Xeon Phi. Intel promised that you could take existing x86 code and just compile it on a Xeon Phi. That was true, but it ran horribly without hand optimization that took about as much effort as it took to get something to run well on a GPU. What has to be understood is that these accelerated portions of the code are a small portion of the total code base but they take up the vast majority of executed operations. For acceleration to be effective it should be optimized carefully. Now very large and legacy code bases, such as exist in the academic world, would take a lot of effort to parallelize using CUDA or OpenCL, etc., and they get a good amount but not all of the benefit by using compiler directives through OpenMP or OpenACC. But that is neither here nor there as far as CUDA or Intel's forthcoming GPU API are concerned, because they are already fully portable on NVIDIA or anyone else's hardware to any hardware that has a compiler written for it. It's orders of magnitude less work to write such a compiler and focus some work on core math libraries used by the compiler than to implement anything in to-the-metal APIs like CUDA. For the foreseeable future, I would bet almost all code run on Intel and AMD's compute GPUs will be 1) run through OpenACC and OpenMP compiler directives and not using a low level API or 2) be focused on certain narrow use cases, mainly deep learning frameworks like TensorFlow. NVIDIA have spent billions of dollars over years to develop their trove of CUDA libraries, all optimized for their architecture. It will take years for Intel to have a chance to catch up, and AMD probably doesn't have the money to do so.
OpenCL takes the opposite approach as CUDA and presumably the way Intel's API will be approached. It implements things in the language by committee which then should be implemented into the architectures. I guess if you are lucky you can get the things you are working on in your architecture implemented into OpenCL, but it's still going to come along a lot more slowly than if you are able to implement it in your own language yourself. GPU architectures are currently changing so quickly at the moment that the OpenCL approach is backwards. There's no way OpenCL can keep up.
The only possible way to beat Cuda (Nvidea) now is if MS, Intel, Google and everyone else pool their resources together and heavily outspent Nvidia for a few years to develop something much better thus warranting a code rewrite. Of course, they won't - and so will repeat the lesson Intel learned with mobile and Microsoft learned with Windows mobile : being late to market is the same as having an inferior product - because standards and ecosystems (i.e. the software) has all already been built now and people will not easily rebuild it.
I mostly agree but there are some particulars to the current situation that mean there's a chance:
1) The basic architecture that serves the bulk of the lucrative and growing data center market also can serve the gaming, graphics, and HPC markets. These markets can be served without a thriving compute API, and so there is the opportunity for the architecture to hang around for years and slowly build software support for the data center market. In addition, Intel is in a very good position to break into those other markets as long as they can make good products, owing to their client CPU and data center businesses. The Atom architecture and the Windows Mobile operating system were specialized endeavors with much less value if they didn't succeed in those single domains (although, Intel also tried to use Atom in the Xeon Phi, which also failed).
2) AI is a possible entry point to the data center that potentially, depending on how it ends up being used, may require relatively little software support in order to address a sizable revenue stream.
So Intel is fighting an uphill battle, but there are reasons they may hope that if they execute well and by using their large resource pool and huge market power they may be able to break into the segment even at this late date. Of course, what they should have done was to pursue a GPU years ago instead of fooling around with Xeon Phi. My personal gut feeling is that this OneAPI thing will end up being as much marketing gobbledygook as Intel's promise of using Xeon Phi to effectively accelerate existing x86 code. It doesn't matter if you're using "OneAPI" if you have to refactor your code to effectively target the different architectures, anyway.
To follow on the OneAPI thing: yes, you can, for example, plug in a different function from a different library that has a common interface, one that targets a GPU and one that targets a CPU or an FPGA or the NNP. But the problem is that different algorithms are suited better to different architectures. Targeting a CPU you may ideally want to tackle the problem entirely differently at the algorithmic level compared with targeting a GPU.
It's not development they need to compete on with Nvidia, it's developer support. If you're using CUDA (or for that matter, the consumer-side graphics drivers or Gameworks), Nvidia will bend over backwards to help you implement and optimise it. With OpenCV? AMD will happily point you to the source code, and if you ask very nicely read you the documentation.
What is not clear is whether the new GPU support will be easily ported.
"I’m pleased to share today that the Intel® Xe architecture roadmap for data center optimized rendering includes ray tracing hardware acceleration support for the Intel® Rendering Framework family of API’s and libraries."
The intel dpc++ is built on top of SYCL, according to the announcements. The SYCL implementation shows use of a SYCL for OpenCL variant that apparently is a common interface to OpenCL implementations on FPGA, DSP, GPU and CPUs, according to their diagram. Perhaps this is the direction of Intel OneAPI. In that case, the portability to NVDA and AMD GPUs would depend on how well this SYCL for OpenCL handles the different OpenCL targets. https://www.khronos.org/sycl/
"Codeplay, which is already known for their several Vulkan / SYCL / SPIR-V initiatives, is working on this layer to run oneAPI / DPC++ / SYCL codes atop NVIDIA hardware while still leveraging NVIDIA's CUDA drivers."
with how many hands Intel has in everyones cookie jar it would not surprise me in the least it either fall flat on it's face OR has immensely potent day of release support (forcing "partners" to make sure of it or they lose support from Intel outright with "sweetheart deals"
"IC: Is Xe anything like Gen at a fundamental level?
RK: At the heart of Xe, you will find many Gen features. ... So we preserved lots of the good elements of Gen, but we had to get an order of magnitude increase in performance. "
So far things look impressive on paper for the compute side of things. Intel is throwing a lot of silicon at performance and they seem to aiming for the performance crown. Too early to say anything about power consumption or efficiency.
CXL based on PCIe 4.0 is going to be a good step forward for internal interconnect. Ian correctly pointed out that the diagrams does not match what they said in terms of topology. At a high level, it looks like each chiplet has an external package link and then an unknown number of internal links. That'd be the simple way to scale things for a true mesh topology between packages. More information is clearly needed here.
What is disappointed is the lack of detail on the node-to-node fabric. Omnipath is dead and could have been a late change in this design. Even without Omnipath, I find it surprising that they aren't leveraging their silicon photonics technology to move that aspect of communication in package as well. That'd be a straight forward way to enable 400 Gbit at scale for this system.
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
16 Comments
Back to Article
firewrath9 - Sunday, November 17, 2019 - link
Hopefully Intel makes good drivers. Perhaps with intel's massive budget they could make a competitor to CUDA, perhaps it can even be open source. Long shot though.Yojimbo - Monday, November 18, 2019 - link
I don't think it matters if it is open source or not. There are plenty of cross-compilers for CUDA code. The issue is that you can't just make a compiler that takes CUDA code optimized for NVIDIA's hardware that then optimizes it for some other hardware and expect that code to run well. You still have to rewrite a lot of the code to fit the other architecture, and CUDA wasn't designed to support other architectures. Now, if Intel did make their API open source, they still aren't going to be adding support to it for rival architectures, and it will still take hand optimization to effectively target those other architectures if they or someone else did, so the issue is much the same.There was the same issue with Xeon Phi. Intel promised that you could take existing x86 code and just compile it on a Xeon Phi. That was true, but it ran horribly without hand optimization that took about as much effort as it took to get something to run well on a GPU. What has to be understood is that these accelerated portions of the code are a small portion of the total code base but they take up the vast majority of executed operations. For acceleration to be effective it should be optimized carefully. Now very large and legacy code bases, such as exist in the academic world, would take a lot of effort to parallelize using CUDA or OpenCL, etc., and they get a good amount but not all of the benefit by using compiler directives through OpenMP or OpenACC. But that is neither here nor there as far as CUDA or Intel's forthcoming GPU API are concerned, because they are already fully portable on NVIDIA or anyone else's hardware to any hardware that has a compiler written for it. It's orders of magnitude less work to write such a compiler and focus some work on core math libraries used by the compiler than to implement anything in to-the-metal APIs like CUDA. For the foreseeable future, I would bet almost all code run on Intel and AMD's compute GPUs will be 1) run through OpenACC and OpenMP compiler directives and not using a low level API or 2) be focused on certain narrow use cases, mainly deep learning frameworks like TensorFlow. NVIDIA have spent billions of dollars over years to develop their trove of CUDA libraries, all optimized for their architecture. It will take years for Intel to have a chance to catch up, and AMD probably doesn't have the money to do so.
OpenCL takes the opposite approach as CUDA and presumably the way Intel's API will be approached. It implements things in the language by committee which then should be implemented into the architectures. I guess if you are lucky you can get the things you are working on in your architecture implemented into OpenCL, but it's still going to come along a lot more slowly than if you are able to implement it in your own language yourself. GPU architectures are currently changing so quickly at the moment that the OpenCL approach is backwards. There's no way OpenCL can keep up.
Amandtec - Monday, November 18, 2019 - link
The only possible way to beat Cuda (Nvidea) now is if MS, Intel, Google and everyone else pool their resources together and heavily outspent Nvidia for a few years to develop something much better thus warranting a code rewrite. Of course, they won't - and so will repeat the lesson Intel learned with mobile and Microsoft learned with Windows mobile : being late to market is the same as having an inferior product - because standards and ecosystems (i.e. the software) has all already been built now and people will not easily rebuild it.Yojimbo - Monday, November 18, 2019 - link
I mostly agree but there are some particulars to the current situation that mean there's a chance:1) The basic architecture that serves the bulk of the lucrative and growing data center market also can serve the gaming, graphics, and HPC markets. These markets can be served without a thriving compute API, and so there is the opportunity for the architecture to hang around for years and slowly build software support for the data center market. In addition, Intel is in a very good position to break into those other markets as long as they can make good products, owing to their client CPU and data center businesses. The Atom architecture and the Windows Mobile operating system were specialized endeavors with much less value if they didn't succeed in those single domains (although, Intel also tried to use Atom in the Xeon Phi, which also failed).
2) AI is a possible entry point to the data center that potentially, depending on how it ends up being used, may require relatively little software support in order to address a sizable revenue stream.
So Intel is fighting an uphill battle, but there are reasons they may hope that if they execute well and by using their large resource pool and huge market power they may be able to break into the segment even at this late date. Of course, what they should have done was to pursue a GPU years ago instead of fooling around with Xeon Phi. My personal gut feeling is that this OneAPI thing will end up being as much marketing gobbledygook as Intel's promise of using Xeon Phi to effectively accelerate existing x86 code. It doesn't matter if you're using "OneAPI" if you have to refactor your code to effectively target the different architectures, anyway.
Yojimbo - Monday, November 18, 2019 - link
To follow on the OneAPI thing: yes, you can, for example, plug in a different function from a different library that has a common interface, one that targets a GPU and one that targets a CPU or an FPGA or the NNP. But the problem is that different algorithms are suited better to different architectures. Targeting a CPU you may ideally want to tackle the problem entirely differently at the algorithmic level compared with targeting a GPU.JayNor - Saturday, December 7, 2019 - link
Their oneAPI dpc++ uses Sycl along with Intel's extensions.https://spec.oneapi.com/oneAPI/Elements/dpcpp/dpcp...
edzieba - Tuesday, November 19, 2019 - link
It's not development they need to compete on with Nvidia, it's developer support. If you're using CUDA (or for that matter, the consumer-side graphics drivers or Gameworks), Nvidia will bend over backwards to help you implement and optimise it. With OpenCV? AMD will happily point you to the source code, and if you ask very nicely read you the documentation.JayNor - Sunday, November 24, 2019 - link
The oneAPI OSPRay is already compatible with AMD CPUs.https://www.phoronix.com/scan.php?page=article&...
What is not clear is whether the new GPU support will be easily ported.
"I’m pleased to share today that the Intel® Xe architecture roadmap for data center optimized rendering includes ray tracing hardware acceleration support for the Intel® Rendering Framework family of API’s and libraries."
https://itpeernetwork.intel.com/intel-rendering-fr...
JayNor - Sunday, November 24, 2019 - link
The intel dpc++ is built on top of SYCL, according to the announcements. The SYCL implementation shows use of a SYCL for OpenCL variant that apparently is a common interface to OpenCL implementations on FPGA, DSP, GPU and CPUs, according to their diagram. Perhaps this is the direction of Intel OneAPI. In that case, the portability to NVDA and AMD GPUs would depend on how well this SYCL for OpenCL handles the different OpenCL targets.https://www.khronos.org/sycl/
"Codeplay, which is already known for their several Vulkan / SYCL / SPIR-V initiatives, is working on this layer to run oneAPI / DPC++ / SYCL codes atop NVIDIA hardware while still leveraging NVIDIA's CUDA drivers."
https://www.phoronix.com/scan.php?page=news_item&a...
Dragonstongue - Monday, November 18, 2019 - link
with how many hands Intel has in everyones cookie jar it would not surprise me in the least it either fall flat on it's face OR has immensely potent day of release support (forcing "partners" to make sure of it or they lose support from Intel outright with "sweetheart deals"Phynaz - Monday, November 18, 2019 - link
Still better than what AMD has in the HPC space.del42sa - Monday, November 18, 2019 - link
"Ponte Vecchio (PV for short) is just one part of Intel’s Xe plans, as we have already been promised GPUs on 10nm."yes, until now they were talking about 10nm. Now they skipped to 7nm.
Spunjji - Monday, November 18, 2019 - link
No solid information about the architecture, not even indications of how they think it will look... but it'll definitely be great!Sure, Intel. Sure.
JayNor - Saturday, December 7, 2019 - link
"IC: Is Xe anything like Gen at a fundamental level?RK: At the heart of Xe, you will find many Gen features.
...
So we preserved lots of the good elements of Gen, but we had to get an order of magnitude increase in performance. "
https://www.anandtech.com/show/15130/anandtech-exc...
Kevin G - Monday, November 18, 2019 - link
So far things look impressive on paper for the compute side of things. Intel is throwing a lot of silicon at performance and they seem to aiming for the performance crown. Too early to say anything about power consumption or efficiency.CXL based on PCIe 4.0 is going to be a good step forward for internal interconnect. Ian correctly pointed out that the diagrams does not match what they said in terms of topology. At a high level, it looks like each chiplet has an external package link and then an unknown number of internal links. That'd be the simple way to scale things for a true mesh topology between packages. More information is clearly needed here.
What is disappointed is the lack of detail on the node-to-node fabric. Omnipath is dead and could have been a late change in this design. Even without Omnipath, I find it surprising that they aren't leveraging their silicon photonics technology to move that aspect of communication in package as well. That'd be a straight forward way to enable 400 Gbit at scale for this system.
Santoval - Monday, November 18, 2019 - link
CXL runs on top of PCIe 5.0(+), not PCIe 4.0.