Name: Answered by the Experts: Heterogeneous and GPU Compute with AMD’s Manju Hegde
Item: Answered by the Experts: Heterogeneous and GPU Compute with AMD’s Manju Hegde
Author: Anand Lal Shimpi

Original Link: https://www.anandtech.com/show/5847/answered-by-the-experts-heterogeneous-and-gpu-compute-with-amds-manju-hegde

Answered by the Experts: Heterogeneous and GPU Compute with AMD’s Manju Hegde

VIEW ARTICLE

by Anand Lal Shimpi on May 21, 2012 12:58 PM EST

15 Comments

AMD’s Manju Hegde is one of the rare folks I get to interact with who has an extensive background working at both AMD and NVIDIA. He was one of the co-founders and CEO of Ageia, a company that originally tried to bring higher quality physics simulation to desktop PCs in the mid-2000s. In 2008, NVIDIA acquired Ageia and Manju went along, becoming NVIDIA’s VP of CUDA Technical Marketing. The CUDA fit was a natural one for Manju as he spent the previous three years working on non-graphics workloads for highly parallel processors. Two years later, Manju made his way to AMD to continue his vision for heterogeneous compute work on GPUs. His current role is as the Corporate VP of Heterogeneous Applications and Developer Solutions at AMD.

Given what we know about the new AMD and its goal of building a Heterogeneous Systems Architecture (HSA), Manju’s position is quite important. For those of you who don’t remember back to AMD’s 2012 Financial Analyst Day, the formalized AMD strategy is to exploit its GPU advantages on the APU front in as many markets as possible. AMD has a significant GPU performance advantage compared to Intel, but in order to capitalize on that it needs developer support for heterogeneous compute. A major struggle everyone in the GPGPU space faced was enabling applications that took advantage of the incredible horsepower these processors offered. With AMD’s strategy closely married to doing more (but not all, hence the heterogeneous prefix) compute on the GPU, it needs to succeed where others have failed.

The hardware strategy is clear: don’t just build discrete CPUs and GPUs, but instead transition to APUs. This is nothing new as both AMD and Intel were headed in this direction for years. Where AMD sets itself apart is that it is willing to dedicate more transistors to the GPU than Intel. The CPU and GPU are treated almost as equal class citizens on AMD APUs, at least when it comes to die area.

The software strategy is what AMD is working on now. AMD’s Fusion¹² Developer Summit (AFDS), in its second year, is where developers can go to learn more about AMD’s heterogeneous compute platform and strategy. Why would a developer attend? AMD argues that the speedups offered by heterogeneous compute can be substantial enough that they could enable new features, usage models or experiences that wouldn’t otherwise be possible. In other words, taking advantage of heterogeneous compute can enable differentiation for a developer.

In advance of this year’s AFDS, Manju agreed to directly answer your questions about heterogeneous compute, where the industry is headed and anything else AMD will be covering at AFDS. Today we have those answers.

How Fusion and HSA Differ

First and foremost, AMD has been down the GPGPU path before with Fusion. Can you explain how HSA is different?

Existing APIs for GPGPU are not the easiest to use and have not had widespread adoption by mainstream programmers. In HSA we have taken a look at all the issues in programming GPUs that have hindered mainstream adoption of heterogeneous compute and changed the hardware architecture to address those. In fact the goal of HSA is to make the GPU in the APU a first class programmable processor as easy to program as today's CPUs. In particular, HSA incorporates critical hardware features which accomplish the following:

1. GPU Compute C++ support: This makes heterogeneous compute access a lot of the programming constructs that only CPU programmers can access today

2. HSA Memory Management Unit: This allows all system memory is accessible by both CPU or GPU, depending on need. In today's world, only a subset of system memory can be used by the GPU.

3. Unified Address Space for CPU and GPU: The unified address space provides ease of programming for developers to create applications. By not requiring separate memory pointers for CPU and GPU, libraries can simplify their interfaces

4. GPU uses pageable system memory via CPU pointers: This is the first time the GPU can take advantage of the CPU virtual address space. With pageable system memory, the GPU can reference the data directly in the CPU domain. In all prior generations, data had to be copied between the two spaces or page-locked prior to use

5. Fully coherent memory between CPU & GPU: This allows for data to be cached in the CPU or the GPU, and referenced by either. In all previous generations GPU caches had to be flushed at command buffer boundaries prior to CPU access. And unlike discrete GPUs, the CPU and GPU share a high speed coherent bus

6. GPU compute context switch and GPU graphics pre-emption: GPU tasks can be context switched, making the GPU in the APU a multi-tasker. Context switching means faster application, graphics and compute interoperation. Users get a snappier, more interactive experience. As UI's are becoming increasing more touch focused, it is critical for applications trying to respond to touch input to get access to the GPU with the lowest latency possible to give users immediate feedback on their interactions. With context switching and pre-emption, time criticality is added to the tasks assigned to the processors. Direct access to the hardware for multi-users or multiple applications are either prioritized or equalized

As a result, HSA is a purpose designed architecture to enable the software ecosystem to combine and exploit the complementary capabilities of CPUs (sequential programming) and GPUs (parallel processing) to deliver new capabilities to users that go beyond the traditional usage scenarios. It may be the first time a processor company has made such significant investment primarily to improve ease of programming!

In addition on an HSA architecture the application codes to the hardware which enables user mode queueing, hardware scheduling and much lower dispatch times and reduced memory operations. We eliminate memory copies, reduce dispatch overhead, eliminate unnecessary driver code, eliminate cache flushes, and enable GPU to be applied to new workloads. We have done extensive analysis on several workloads and have obtained significant performance per joule savings for workloads such as face detection, image stabilization, gesture recognition etc…

Finally, AMD has stated from the beginning that our intention is to make HSA an open standard, and we have been working with several industry partners who share our vision for the industry and share our commitment to making this easy form of heterogeneous computing become prevalent in the industry. While I can't get into specifics at this time, expect to hear more about this in a few weeks at the AMD Fusion Developer Summit (AFDS).

So you see why HSA is different and why we are excited :)

I haz questions by B3an

Could an OS use GPU compute in the future to speed up everyday tasks, apart from the usual stuff like the UI? What possible tasks would this be? And is it possible we'll see this happen within the next few years?

Yes, definitely. OSes are moving towards providing some base functionality in terms of security, voice recognition, face detection, biometrics, gesture recognition, authentication, some core database functionality. All these benefit significantly from the optimizations in HSA described above. With the industry support we are building this should happen in the next few years.

Are you excited about Microsofts C++ Accelerated Massive Parallelism (AMP)? Do you think we'll see a lot more software using GPU compute now that Visual Studio 11 will include C++ AMP support?

We see C++amp as a great alternative to OpenCL. Both OpenCL and C++amp provide a method for utilizing the underlying GPU compute infrastructure, each with its own benefits. AMD realizes that different class of programmers may have different language preferences, so we will support both languages with the same level of quality, in order to serve our developer community better. C++AMP is a small delta to C++ and as such will appeal to many mainstream programmers and with Microsoft's support be able to reach a vast audience. So yes, we expect to see a lot more software using heterogeneous compute through this direction.

Do you expect the next gen consoles to make far more use of GPU compute?

Cannot comment further on this since these are products being brought forward by other companies.

Question by mrdude

The recent Kepler design has shown that there might be a chasm developing between how AMD and nVidia treat desktop GPUs. While GCN showed that it can deliver both fantastic compute performance (particularly on supported openCL tasks), it also weighs in heavier than Kepler and lags behind in terms of gaming performance. The added vRAM, bus width and die space for the 7970 allow for greater compute performance but at a higher cost; is this the road ahead and will this divide only further broaden as AMD pushes ahead? I guess what I'm asking is: Can AMD provide both great gaming performance as well as compute without having to sacrifice by increasing the overall price and complexity of the GPU?

Yes. You will see our future products continue to balance great gaming and compute performance.

It seems to me that HSA is going to require a complete turnaround for AMD as far as how they approach developers. Personally speaking, I've always thought of AMD as the engineers in the background who did very little to reach out and work with developers, but now in order to leverage the GPU as a compute tool in tasks other than gaming it's going to require a lot of cooperation with developers who are willing to put in the extra work. How is AMD going about this? and what apps will we see being transitioned into GPGPU in the near future?

Things have changed J. AMD has a team focused on developer solutions and outreach. This team drives the definition and deployment of tools, libraries, SDKs to the developer ecosystem including enablement content such as blogs, white papers and webinars. In addition, AMD works with key developers and also contributes to prominent open source code bases to promote GPU compute. The launch of the 2nd Generation AMD A-Series "Trinity" APU includes numerous applications that use the GPU for compute acceleration – Photoshop CS6, Winzip, x264/Handbrake, GIMP to name a prominent few. There are more plans in the works to reach out to developers and make it easy for them to extract the most from HSA platforms.

Offloading FP-related tasks to the GPU seems like a natural transition for a type of hardware that already excels in such tasks, but was HSA partly the reason for the single FPU in a Bulldozer module compared to the 2 ALUs?

We constantly evaluate the tradeoff of where to add compute execution resources. It is more expensive to add more computation resources in the CPU core since CPU vector execution resource are typically clocked higher, have multi-ported register files, support out-of-execution for latency hiding, etc. That said, the Bulldozer FPU does include support for new FMAC instructions and higher clock rate. So we really are investing in both CPU and GPU.

Is AMD planning to transition into an 'All APU' lineup for the future, from embedded to mobile to desktop and server?

AMD is all about meeting customer needs. We already have APUs for embedded, mobile (tablet and notebook) and desktop, and will address APUs for server as we continue to monitor what the market and customer needs are. In fact, we already have some partners incorporating APUs into server designs, one of which is Penguin Computing who is keynoting AFDS…

OpenCL by A5

What is AMD doing to make OpenCL more pleasant to work with?

Some of the initiatives that AMD has already driven are:
- Improved Debugger and Profiler (Visual Studio plugin, Standalone Eclipse, Linux)
- Static C++ interface (APP SDK 2.7)
- Extended tools thru MCW (PPA, GMAC, TM)
- OpenCL Book, Programming Guide (US, India, China editions)
- University course kit
- Webinars (Various topics)
- Online self-training material
- Hands-on tutorial, content in AFDS
- Moderated OpenCL forum
- OpenCL Training and Services Partners
- OpenCL acceleration of major Open Source codebases
- Aparapi to enable and ease Java users in using OpenCL

Questions by ltcommanderdata

WinZip and AMD have been promoting their joint venture in implementing OpenCL hardware accelerated compression and decompression. Owning an AMD GPU I appreciate it. However, it's been reported that WinZip's OpenCL acceleration only works on AMD CPUs. What is the reasoning behind this? Isn't it hypocritical, given AMD's previous stance against proprietary APIs, namely CUDA, that AMD would then support development of a vendor specific OpenCL program?

The OpenCL version of Winzip has been optimized on AMD GPUs and achieves significant performance gains. While OpenCL is not vendor specific optimizations on any application are essentially vendor specific since they depend on the microarchitecture of eachvendor. We worked closely with Winzip to get these optimizations in. We have the most mature OpenCL implementation currently and even then we just managed to get the QA done before WinZip's launch date. I am sure that they will be coming out with OpenCL optimized versions on Intel and Nvidia soon -- you should ask them). That is in fact the beauty of OpenCL – one code base gives functional portability across vendor platforms and optimizations are the only components that need to be scheduled. So yes this is in line with our embrace of open and industry standards.

This may be related to the above situation. Even with standardized, cross-platform, cross-vendor APIs like OpenCL, to get the best performance, developers would need to do vendor specific, even device generation within a vendor specific optimizations. Is there anything that can be done whether at the API level, at the driver level or at the hardware level to achieve the write-once, run-well anywhere ideal?

Device specific optimizations will always have a beneficial impact on performance. This is true even with CPUs. While differences between GPUs are more dramatic, this is due to the fact that today's GPUs are designed to excel at graphics while compute is a secondary consideration. Reluctance to spend more chip area on compute results in having many device specific performance "cliffs". For example, VLIW instructions, 64 thread wavefronts, and the need for coalesced accesses to memory. As GPUs are increasingly used for compute, and as it becomes possible to add yet more transistors, these "cliffs" will continue to disappear. Advances in compilers will also help.

Comparing the current implementations of on-die GPUs, namely AMD Llano and Intel Sandy Bridge/Ivy Bridge, it appears that Intel's GPU is more tightly integrated with CPU and GPU sharing the last level cache for example. Admittedly, I don't believe CPU/GPU data sharing is exposed to developers yet and only available to Intel's driver team for multimedia operations. Still, what are the advantages and disadvantages of allowing CPUs and GPUs to share/mix data? I believe memory coherency is a concern. Is data sharing the direction that things are eventually going to be headed?

For some workloads we expect data sharing between the CPU and GPU. In many cases the data being shared is quite large – for example a single frame of HD video with 4bytes/pixel is 8MB of data, and many algorithms are dealing with multiple frames of video so even seemingly large shared caches are not effective at capturing real-world working sets. We see clear advantages from CPU/GPU shared address spaces (same page tables) and high-bandwidth memory access from both devices.

As a follow up, it looks like the just released Trinity brings improved CPU/GPU data sharing as per question 3 above. Maybe you could compare and contrast Trinity and Ivy Bridge's approach to data sharing and give an idea of future directions in this area?

See above.

Related to the above, how much is CPU<>GPU communications a limitation for current GPGPU tasks? If this is a significant bottleneck, then tightly integrated on-die CPU/GPUs definitely show their worth. However, the amount of die space that can be devoted to an IGP is obviously more limited than what can be done with a discrete GPU. What can then be done to make sure the larger computational capacity of discrete GPUs isn't wasted doing data transfers? Is PCIe 3.0 sufficient? I don't remember if memory coherency was adopted for the final PCIe 3.0 spec, but would a new higher speed bus, dedicated to coherent memory transfers between the CPU and discrete GPU be needed?

For some applications, the CPU/GPU communication is so severe a limitation that it eliminates the gains from using the GPU. (For other algorithms, the communication is small or can be overlapped and the GPU can be used quite effectively). PCIe3.0 helps for large-block data transfers. Inherently though discrete GPUs will continue to provide higher peak computation capabilities (since the entire die is dedicated to GPU) but less tightly integrated than what can be achieved with an APU.

In terms of gaming, when GPGPU began entering consumer consciousness with the R500 series, GPGPU physics seemed to be the next big thing. Now that highly programmable GPUs are common place and the APIs have caught up, mainstream GPGPU physics is no where to be found. It seems the common current use cases for GPGPU in games is to decompress textures and to do ambient occlusion. What happened to GPGPU physics? Did developers determine that since multi-core CPUs are generally underutilized in games, there is plenty of room to expand physics on the CPU without having to bother with the GPU? Is GPGPU physics coming eventually? I could see concerns about contention between running physics and graphics on the same GPU, but given most CPUs are coming integrated with a GPGPU IGP anyways, the ideal configuration would be a multi-core CPU for game logic, an IGP as a physics accelerator, and a discrete GPU for graphics.

GPUs can run physics just fine. The problem with GPU physics is scaling. Unlike graphics, which easily scales across a wide range of hardware capabilities (eg: by changing resolution, using antialiasing, and changing texture resolution), it is very difficult to scale simulation compute requirements. Game programmers and artists must do a lot of extra work to take advantage of increased simulation capability, which they are reluctant to do, since they are usually happy to target lowest common denominator (consoles). This will continue to be the case until tools and runtimes are available which allow artists to create scalable physics content with little to no additional effort. HSA is an ideal architecture for running physics.

Latency and overhead by GullLars

Will the GPGPU acceleration mainly improve embarrassingly parallel and compute bandwidth constrained applications, or will it also be able to accelerate smaller pieces of work that are parallel to a significant degree.

Hitherto workloads with a significant amount of data parallel components only could benefit from heterogeneous compute. However since with HSA APUs the communication between GPU and CPU is no longer subject to unnecessary copies, no cache flushes are automatically invoked, and the optimization of the runtime and driver stacks greatly reduces the dispatch latency, the type and number of workloads that are benefited from heterogeneous compute are greatly increased.

And what is the latency associated with branching off and running a piece of code on the parallel part of the APU? (f.ex. as a method called by a program to work on a large set of independent data in parallel)

Different on different products

Change starts with you by Tanclearas

Although I do agree that there are many opportunities for HSA, I am concerned that AMD's own efforts in using heterogeneous computing have been half-baked. The AMD Video Converter has a smattering of conversion profiles, lacks any user-customizable options (besides a generic "quality" slider), and hasn't seen any update to the profiles in a ridiculously long time (unless there were changes/additions within the last few months).

AMD recognizes that heterogeneous compute requires specific and new measures to ease developer adoption. To this end AMD is adopting the strategy of delivering domain-specific SDKs and providing optimized sample applications. These serve as reference code to ease the developer's job of extracting performance especially for targeted and common use cases. APP SDK is an example - stay tuned for more

It is no secret that Intel has put considerable effort into compiler optimizations that required very little effort on the part of developers to take advantage of. AMD's approach to heterogeneous computing appears simply to wait for developers to do all the heavy lifting.

The question therefore is, when is AMD going to show real initiative with development, and truly enable developers to easily take advantage of HSA? If this is already happening, please provide concrete examples of such. (Note that a 3-day conference that also invites investors is hardly a long-term, on-going commitment to improvement in this area.)

Just to clarify, HSA is not available today. We outlined our roadmap for the future of APUs last year at AFDS, which included the evolution of HSA. Most of the HSA features will be available on our 2013 and 2014 platforms. We are going to announce the schedule for availability of our HSA software stack, our tools and the library plan at AFDS. AFDS is a continued forum where we will bring together software developers to interact with us and our partners to let them know the direction of our platforms in the future. The fact that investors attend does not detract from the fact that it is targeted primarily at software developers. The overwhelming majority of presentations and talks are directed at software developers. Several key partners will be delivering keynotes at AFDS expressing their aligned view of heterogeneous computing including technical leaders from Adobe. Cloudera, Penguin Computing, Gaikai and SRS.

We have just announced the increasing gamut of software who support OpenCL on our platforms today. These include companies such as SONY, Adobe, Arcsoft, Winzip, Cyberlink, Corel, Roxio, and many, many others. We are confident all of them will be enthusiastic about supporting HSA.

In addition see the answer to the above question and what we are doing wrt making OpenCL easier to use.

Two questions by markstock

Mr. Hegde, I have two questions which I hope you will answer.

To your knowledge, what are the major impediments preventing developers from thinking about this new hierarchy of computation and begin programming for heterogenous architectures?

See my answer to the first question where I list the hardware features of HSA and the issues they solve. Those are all issues with today's heterogeneous compute models.

AMD clearly aims to fill a void for silicon with tightly-coupled CPU-like and GPU-like computational elements, but are they only targeting the consumer market, or will future hardware be designed to also appeal to HPC users?

Absolutely. We will be bringing HSA based APUs to the market in the near future and all the aspects of ease of programming and much greater performance per joule that HSA brings to the market will greatly benefit the HPC space. In fact, Penguin Computing, is already implementing APUs in HPC server designs and will be sharing details on HPC heterogeneous compute at AFDS during their keynote.

When will the software catch up? by Loki726

AMD Fellow Mike Mantor has a nice statement that I believe captures the core difference between GPU and CPU design.

"CPUs are fast because they include hardware that automatically discovers and exploits parallelism (ILP) in sequential programs, and this works well as long as the degree of parallelism is modest. When you start replicating cores to exploit highly parallel programs, this hardware becomes redundant and inefficient; it burns power and area rediscovering parallelism that the programmer explicitly exposed. GPUs are fast because they spend the least possible area and energy on executing instructions, and run thousands of instructions in parallel."

Notice that nothing in here prevents a high degree of interoperability between GPU and CPU cores.

When will we see software stacks catch up with heterogeneous hardware? When can we target GPU cores with standard languages (C/C++/Objective-C/Java), compilers(LLVM, GCC, MSVS), and operating systems (Linux/Windows)? The fact that ATI picked a different ISA for their GPUs than x86 is not an excuse; take a page out of ARM's book and start porting compiler backends.

AMD is addressing this via HSA. HSA addresses these fundamental points by introducing an intermediate layer (HSAIL) that insulates software stacks from the individual ISAs. This is a fundamental enabler to the convergence of SW stacks on top of HC.

Unless the install base is large enough, the investment to port *all* standard languages across to an ISA is forbiddingly large. Individual companies like AMD are motivated but can only target a few languages at a time. And the software community is not motivated if the install base is fragmented. HSA breaks this deadlock by providing a "virtual ISA" in the form of HSAIL that unifies the view of HW platforms for SW developers. It is important to note that this is not just about functionality but preserves performance sufficiently to make the SW stack truly portable across HSA platforms

Why do we need new languages for programming GPUs that inherit the limitations of graphics shading languages? Why not toss OpenCL and DirectX compute, compile C/C++ programs, and launch kernels with a library call? You are crippling high level languages like C++-AMP, Python, and Matlab (not to mention applications) with a laundry list of pointless limitations.

AMD sees OpenCL as a critical and necessary step in the evolution of programming. Single-core programming evolved from assembly to C++ and Java. Starting with very few expert programmers doing close-to-metal coding, to a larger number of trained professionals driving products and finally making it easier for minimally trained programming masses to target CPUs. Symmetric multi-core programming went thru a similar trend thru pthreads to models like OpenMP and TBB.

Today, pioneered by experts who managed to write compute code within shaders, heterogeneous compute now has its first standard programming model in OpenCL. AMD introduced Aparapi that provides Java developers an easy way to access GPU compute. C++ AMP is the first instance of the natural next step in this evolution, i.e. extensions of existing programming models to target GPU compute and thus bringing in the (large) community adoption. AMD will strongly support this expansion into languages like Fortran, Python, Ruby, R, Matlab…

In addition, domain-specific libraries are also being targeted, e.g. OpenCV, x264, crypto++, to allow the programmer to focus on the job at hand, instead of the mechanics of obtaining performance. This is the fastest way to enable existing application code bases to leverage heterogeneous compute.

And of course, HSA is a key enabler of this next step since it expands the install base for SW developers to target via the portable performance it enables across various ISAs.

However, similar to assembly optimizations, AMD does see OpenCL continue to coexist with high-level programming to enable performance-critical developers to extract the most out of a particular platform.

Where's separable compilation? Why do you have multiple address spaces? Where is memory mapped IO? Why not support arbitrary control flow? Why are scratchpads not virtualized? Why can't SW change memory mappings? Why are thread schedulers not fair? Why can't SW interrupt running threads? The industry solved these problems in the 80s. Read about how they did it, you might be surprised that the exact same solutions apply.

- OpenCL 1.2 (supported by the upcoming AMD APP SDK 2.7) supports clCompileProgram and clLinkProgram.
- HSA MMU enables a shared address space between CPU and GPU
- HSAIL supports more flexible control flow.
- SI-based GPUs include high-performance read/write caches which effectively can be virtualized.
- Future AMD APUs will support HW context switching, including ability for SW to interrupt running threads