Name: OpenCL 1.0: The Road to Pervasive GPU Computing
Item: OpenCL 1.0: The Road to Pervasive GPU Computing
Author: Derek Wilson

Original Link: https://www.anandtech.com/show/2698

OpenCL 1.0: The Road to Pervasive GPU Computing

VIEW ARTICLE

by Derek Wilson on December 31, 2008 6:40 PM EST

Posted in
GPUs

37 Comments

Earlier this month, the OpenCL specification was released by the Khronos group. Khronos is a group made up of representatives from companies in the computing industry. The group focuses on creating and managing standards for graphics, multimedia and parallel computing on everything from mobile devices to desktop and workstation computers. Part of Khronos' charge is OpenGL and all it's relatives with the Open- prefix, so naming also makes sense.

The goal of OpenCL is to make certain types of parallel programming easier and to provide vendor agnostic hardware accelerated parallel execution of code. That's a bit of a mouth full, but the bottom line is that OpenCL will give developers a common set of easy to use tools to use to take advantage of any device with an OpenCL driver (processors, graphics cards, ect.) for the processing of parallel code.

While there are already tools available that enable parallel processing, these tools are largely dedicated to task parallel models. The task parallel model is built around the idea that parallelism can be extracted by constructing threads that each have their own goal or task to complete. While most parallel programming is task parallel, there is another form of parallelism that can greatly benefit from a different model.

In contrast to the task parallel model, data parallel programming runs the same block of code on hundreds (or thousands or millions or ...) of data points. Whereas my video game may have threads for handling AI, physics, audio, game state, rendering, and possibly more finely grained tasks if I'm up to the challenge, a data parallel program to do something like image processing may spawn millions of threads to do the processing on each pixel. The way these threads are actually grouped and handled will depend on both the way the program is written and the hardware the program is running on.

As we've said many times in the past, graphics is almost infinitely parallelizable. Millions of pixels on the screen can all act (mostly) independently of each other. Light weight threads handle the calculation of everything that has to do with a particular pixel. As pixels get smaller and we pack more on screens, there is more opportunity for parallel work. Graphics cards are currently the best data parallel processing engines we have available. And once OpenCL drivers are available, developers will have access to all that horsepower for any other data parallel tasks they see fit.

Now, it won't make sense to run a word processor on your graphics card, as there just isn't enough happening at once to take advantage of the hardware. Single threaded performance on a GPU isn't that great, especially compared to a general purpose CPU, and trying to run code that isn't massively parallel just isn't going to be a great idea. But there are plenty of things that can benefit from the GPU. Basically any multimedia processing can benefit, from video and audio decoding, editing, and encoding, to image manipulation, to helping speed up your math homework (brute force computation ala Maple, Matlab, and Mathematica could certainly benefit from the GPU). There could be some interesting encryption and/or compression techniques that are born out of the data parallel approach as well.

The best applications of data parallel computing have likely not been seriously considered at this point, as it takes time to get from the availability of tools to the finished product, let alone the conception of ideas that have heretofore been precluded by the realities of parallel programming. But OpenCL isn't a miracle that will make everything speed up. Rather it is a vehicle by which developers will be able to make a small subset of tasks orders of magnitude faster using hardware that is already in most people's computers. Which is certainly nice. But let's take a closer look.

Why is Parallel Computing Hard?

There are plenty of issues with parallel programming. Breaking up the problem is often the most important and complex step, especially when the parallelism is not obvious. As we are rooted in a world of sequential programming, conceptualizing the parallelization of tasks that lend themselves to sequential programming is tough. This can require not only the reworking of code, but redesigning the entire process of solving a problem.

Even in problems that lend themselves to parallelism, exploiting the parallelism can be tough. Even if you know the best and fastest algorithm for solving a data parallel problem, it isn't always possible to translate that to an efficient program. For instance, if I want to multiply two matrices with 100k x 100k dimentions, I can't just spawn all the threads I would need. If I were using POSIX threads to calculate one cell of the result matrix each, I would spend more time creating threads and allocating resources than actually doing the computation. I've got to take the resources I have and use them to the best of my ability. Though I can do matrix multiplication in parallel, I have to be careful about how I break up the problem and I can't exploit all the parallelism possible because of the tools I normally work with.

We are also limited in terms of hardware resources. With only a few processors available for general purpose programming, even if the software overhead weren't an issue we couldn't actually get any speed up from parallelizing beyond a certain point. This not only means that we can't exploit tons of parallelism even if the algorithm lends itself to it and this discourages programmers from thinking in terms of parallelism.

How Does OpenCL Help?

What if we had not only a pool of hardware resources hundreds wide that could handle thousands of threads in flight at a time with no software overhead? Well, we do: it's called a GPU. And if we could use the GPU for processing, then we could spawn a bunch of threads and really chew through the matrix multiplication we talked about earlier (or whatever). We might still have to be concerned about how many hardware resources we have in order to best map the problem to the specific device in the system. And we still have the problem of actually spawning, managing and running threads on the GPU hardware.

But what if we could write a special function, called a kernel, that can instantly be spawned hundreds or thousands or millions of times and run on different data all without needing to handle creating and managing all the threads ourselves. And what if we didn't need to worry about how to break up our problem and left actually determining how to handle allocating threads to the runtime? Well, now we have a solution: that's OpenCL.

The GPU is the vehicle for exploiting data parallelism. But before now our vehicle has run like a train on a track called real-time 3D graphics acceleration. OpenCL removes the track and the limitations and builds in a steering wheel developers can use to take the GPU (and other parallel devices) anywhere a programmer can imagine.

Open, Closed, Proprietary ... Sorting out the Confusion

Over the past few months, we've seen plenty of confusion over the direction NVIDIA and AMD are taking with respect to GPU computing. This isn't helped by either AMD or NVIDIA who both tend to tout the advantages of their approach and the disadvantages of the other guy's take on it.

AMD and supporters tend to claim that NVIDIA's CUDA is not optimal because it is not an open standard and that AMD supports openness because their solution (Brook+) is open source. But Brook+ isn't an open standard either: it was developed at Stanford University and hasn't been standardized. While the source for the Brook+ compiler is available, it would take a large investment to retool it for NVIDIA hardware. Even then, you'd need to build different versions of a program for AMD and NVIDIA platforms. The original GPGPU based Brook is a different story as it generated OpenGL code to do the GPGPU work, but modifying it to generate CAL code makes it very not interoperable and not very open or standard. At least as those terms are used when talking about languages, APIs and interoperability.

NVIDIA isn't much better though. They tend to act like anything AMD does is to copy them and amounts to nothing because CUDA for C is the gold standard for GPU computing and they don't have it, which just isn't the case. In fact, AMD started demonstrating concerted efforts to advance GPU computing before we saw anything from NVIDIA, and in much more interesting ways.

With R580 AMD (then ATI) actually published part of their ISA and called the initiative CTM (for Close to Metal). Before we had a beta version of CUDA, we had folding@home GPU accelerated on R520 and R580. Beyond that, CUDA for C has done really well in the HPC (high performance computing) space, but it hasn't caught on in the consumer space. Neither AMD nor NVIDIA have a viable consumer oriented solution for GPU computing.

So NVIDIA has the HPC market with CUDA and have gotten some universities to start teaching data parallel programming using CUDA for C. AMD could make an investment in the CUDA for C language and create either their own compiler (nothing is stopping them). But then you still have the same problem of interoperability as if NVIDIA implemented Brook+. If NVIDIA or AMD want to make their solution work with the other guy, they would need to write a wrapper to translate CAL to PTX or PTX to CAL. Or we could go a different direction and work on building an industry standard virtual ISA for data parallel architectures. But I doubt that effort would ever take off.

So the bottom line is that both AMD and NVIDIA support both proprietary (Brook+ and CUDA for C) and open standard (OpenCL) solutions. There are further differences between Brook+ and CUDA, but the important part is that these proprietary solutions are not ever going to be able to produce one binary that runs on both AMD and NVIDIA hardware both because of the approach used and the fact that AMD and NVIDIA aren't going to work closely enough to make something like that work. At least in the foreseeable future.

OpenCL, on the other hand, offers developers the ability to write an application once, compile it once, and expect it to run on all major GPU hardware. Something that could never happen with ether CUDA or Brook+.

Why NVIDIA Thinks CUDA for C and Brook+ Are Viable Alternatives

While OpenCL is a high level API, it does require the programmer to perform certain tasks that don't have much to do with the parallel algorithm being implemented. OpenCL devices in the system need to be found and set up to properly handle the task at hand. This requires a lot of overhead like creation of a context, device selection, creating the command cue(s), management of buffers for supplying and collecting data on the OpenCL device, and dynamically compiling OpenCL kernels within the program. This is all in addition to writing kernels (data parallel functions) and actually using them in a program that does useful work.

The overhead and management work required is similar to what goes on with OpenGL. This makes sense considering the fact that both use GPUs, they can share data with eachother, and that the same standards body that manages OpenGL is now managing OpenCL. But the fact remains that this type of overhead is cumbersome and can be a real headache for anyone who is more interested in the algorithm. Like scientists working on HPC code who know the theory much better than the programming most of the time.

Both Brook+ and CUDA for C hide the complexity of setting up the hardware by allowing the driver to handle the details. This allows developers to write a kernel, use it, and forget about what's actually going on in the hardware for the most part. Going with something like this as a first move for both NVIDIA and AMD was a good move, as it allows developers to get familiar with the type of programming they will be doing in the future for data parallel problems without tacking higher levels of complexity than necessary.

NVIDIA, for one, believes a language extension as opposed to an API like OpenCL has major benefits and will always have a place in GPU computing (and especially in the HPC space where scientists don't want to be programmers any more than they need to). When asked if they would submit their language to a standards body, NVIDIA said that was highly unlikely as there are other language efforts out there and NVIDIA has been advancing CUDA for C much more rapidly than a standards body would.

On the down side, putting more control in the hands of the developer can result in better, faster code. There is a bit of a "black box" feeling to these solutions: you put code in and get results out, but you can't be sure what goes on in the middle to make it happen. OpenCL gives you better ability to fine tune the software and make sure that exactly what you want to happen happens. Despite NVIDIA's assertions that scientists interested in coding for HPC solutions will have a better experience with CUDA, the cost/benefit of ultra-fine tuning code for HPC machines leans heavily in favor of spending the time and money on optimizing. This means that OpenCL will likely be the choice for performance sensitive HPC applications. CUDA for C and Brook+ will likely have more of a place in just trying out ideas before settling on a final direction.

So there you have it. OpenCL will enable applications in the consumer space to take advantage of data parallel hardware, while Brook+ and CUDA may still have a place in the industry as well (but not on the consumer side of things). That is, until some other more popular standard data parallel language extensions come along and pushes both CUDA for C and Brook+ out of the market.

OpenCL Extending OpenGL

OpenGL 3.0 was a disappointment to game developers who hoped the API would add some key features that ended up being left behind. With the latest release, Khronos relegated OpenGL to professional and workstation applications like CAD/CAM and 3D content creation software, foregoing the wants and desires of game programmers. While not ideal from our perspective (competition is always good), the move is understandable, as OpenGL hasn't been consistently used by any major game engine developer other than Id software for quite some time. DirectX is seen as the graphics API of choice for game programming, and it looks like it will remain that way for the foreseeable future.

But OpenCL does bring an interesting element to the table. One of the major advancements of DirectX 11 will be the addition of a compute shader to the pipeline. This compute shader will be general purpose and capable of operating on diverse data structures that pixel shaders are not geared towards. It will be capable of things like OpenCL is, though it will be tuned and geared toward doing so in the context of graphics. It is, after all, still DirectX. In DX11, the pixel shader and compute shader will share data via data structures rather than any sort of formal input/output mechanism. Because of the high level of integration, game developers (and other graphics engine developers) will be capable of tightly combining current techniques with more general purpose code that can handle a broader array of algorithms.

OpenGL doesn't have anything like this in the works, but OpenCL fixes that. OpenCL is capable of sharing data with OpenGL. And we aren't talking about copying data back and forth easily, we are talking about physically sharing data structures and memory locations. This essentially adds a compute shader to OpenGL for those who want it. Why is that the case? well, offering OpenCL users a means of using OpenGL images and buffers as OpenCL images and buffers means that OpenGL and OpenCL can share data with no copy or conversion overhead. This means that not only are OpenGL and OpenCL able to work on the same data, but that the method by which they communicate is very similar to what DX11 does to allow the passing of data between pixel an geometry shaders.

While game developers may be intrigued, the professional app developers may have more of a reason to get excited. Sure, this will allow OpenGL game developers to use a compute shader like option, but it gives professional application developers the ability to actually combine the real work of simulation or data manipulation with visualization. With support for double precision in hardware that supports it, this could be useful for applications where a lot of real work needs to be done both on the thing being visualized and the visualization itself. This could speed things up quite a bit and allow fluid realtime visualization and manipulation of much more complicated data sets.

Additionally, this compute shader will work on hardware not specifically designed as DX11 class hardware. DX11, as a strict superset of DX10, will extend some functionality to DX10 hardware, but we aren't yet certain about the specifics of this and it may include CS functionality. On top of this, OpenCL should get drivers in the first quarter of next year. This puts the combination of OpenGL 3.0 plus OpenCL 1.0, for the first time in a long time, ahead of DirectX in terms of technology and capability. This is by no means a result of the sluggish and non-innovative OpenGL ARB. But maybe this will inspire more use of OpenGL, which maybe will inspire more innovation from the ARB. But I'm not going to hold my breath on that one.

In any case, the fact that OpenGL and OpenCL can share data without requiring a copy or conversion is a key feature. Not only will OpenCL allow developers to use the GPU for general purpose computing, but using OpenCL with OpenGL will help build a bridge between data parallel computing and visualization. Existing solutions like CUDA and Brook+ haven't done very well in this area, and using OpenGL or DirectX for data parallel processing makes it difficult to get work done efficiently. OpenCL + OpenGL solves these problems.

And maybe we'll even see things go the other way as well. Maybe developers doing massive amounts of parallel data processing using OpenCL not formerly interested in "seeing" what's happening will find it easy and beneficial to enable advanced visualization of their data or the processing thereof through integration with OpenGL. However they are used together, OpenCL and OpenGL will definitely both benefit from their symbiotic relationship.

Final Words

Both AMD and NVIDIA have touted the fact that as soon as they are able they will support OpenCL. Even though the specification has been released, it is not yet possible to claim OpenCL support because we don't yet have any qualification tests. NVIDIA and AMD will need to be able to correctly compile and execute OpenCL code and programs, and match results for calculations within certain tolerances. OpenCL drivers should start trickling out some time next quarter. Until then, developers do have access to the specification and header files so they can start playing with it as well.

Unfortunately, even if we had final drivers today we would have to wait for a quite some time before we see the first real apps trickle out. We expect a higher volume of consumer level applications than we've seen with CUDA, as there is greater incentive to develop using OpenCL. The fact that the vast majority of modern graphics cards will support OpenCL and the fact that the vast majority of computers have modern graphics cards installed means that once OpenCL drivers arrive developers will instantly have standardized and easy access to hundreds of times more compute power for general purpose processing of data parallel algorithms.

While AMD and NVIDIA will likely cary on their efforts with ATI Stream and CUDA, unless and until there is a language that can target all GPUs we are more likely to see OpenCL thrive. No matter how much easier it might be to leave all the overhead and management to the system or the driver, putting the power in the hands of the developer will always enable higher performance and more innovative usage of the hardware.