Comments for Qualcomm Details Hexagon 680 DSP in Snapdragon 820: Accelerated Imaging

Qualcomm Details Hexagon 680 DSP in Snapdragon 820: Accelerated Imaging

by Joshua Ho on 8/24/2015 9:00 AM EST

Post Your Comment
Please log in or sign up to comment.

Comments Locked

41 Comments

Back to Article

madwolfa - Monday, August 24, 2015 - link
"Today at Hot Chips..." ... I smell the irony.
sandy105 - Monday, August 24, 2015 - link
The only first comment you'll ever need to read !! lol
ddriver - Monday, August 24, 2015 - link
Hmm... the DSP is bigger than the CPU in terms of die area...
ddriver - Monday, August 24, 2015 - link
On a second glance... it seems the primary motivation is to promote their DSP architecture. I mean the GPU is most likely OpenCL 2 compliant, and offers easily 10 times the CPU compute performance at the same power. Oh, and it has FP, both 32 and 64bit, which is a must for professional image, video or sound processing. That DSP is very limited use and special purpose, complicated and not readily available for programming by the user, that die area could be better invested into a bigger GPU, which will deliver better graphics performance and can easily handle all the tasks that would typically be handled by the DSP.
saratoga4 - Monday, August 24, 2015 - link
This isn't really the right way to think about it. DSPs and GPUs are related in that they are both heavily application optimized general purpose processors, but they are optimized in very, very different ways. A GPU is massively thread-parallel, high latency, and very good at floating point multiply-add instructions, while very, very bad at branches. A DSP is usually fixed (or sometimes floating point) optimized, very low latency, very fast at branches, and with very high single threaded performance per watt. For tasks the GPU is good at you really don't want to use the DSP. For tasks the DSP is good at, you really don't want to use the GPU.
ddriver - Monday, August 24, 2015 - link
GPU latency has historically been high because on a typical desktop system with discrete GPU you have to transfer data back and forth PCI-E. On this chip the memory is shared between the GPU and the CPU, so all the latency penalty of having to transfer data is eliminated.

They put focus on sensors and image processing. Neither are so demanding that the GPU latency on this chip would be an issue.

Naturally, there is no benefit from pushing sensor data to the GPU, that could be handled by a dedicated MCU, like apple did. And while it may be true that it might be a tad more efficient with a dedicated DSP, I doubt the difference will in fact be tangible on the device scale.

As for image processing, I reckon the GPU is a much better fit, it has more power and more features. And it can run OpenCL, which means applications will be portable and run on every platform which supports OpenCL, whereas with the DSP you must explicitly target it, and that code will do you no good on other platforms.

All in all, as I said, seems like qualcomm are pushing for a proprietary technology with the hopes of locking in developer and subsequently user base. That's bad. There are better ways to invest that chip die area.

An MCU for the sensors and a GPU for image processing is simple, more flexible, more portable. It is a full and feature rich chain to power the device - the MCU can run a variety of tasks, even a simple OS, making it possible to suspend the main CPU completely, the CPU ALUs - low latency, poor throughput, the CPU SIMDs - medium latency, medium throughput, and the GPU - "high" latency, highest throughput.
saratoga4 - Monday, August 24, 2015 - link
>As for image processing, I reckon the GPU is a much better fit,

Pretty sure you've never programmed a GPU or DSP then :)

Problems with a GPU for these applications: very poor power efficiency, lack of good fixed point hardware support (modern GPUs are floating point), and extremely poor handing of branches. You can do some simple image processing applications efficiently on a GPU (e.g. filtering is very natural) but more complex operations are very hard to implement. You really don't want to use a GPU for this stuff. It makes even less sense than the CPU for a lot of things.
ddriver - Monday, August 24, 2015 - link
"lack of good fixed point hardware support (modern GPUs are floating point)" - shows what you know. GPUs have no problem with integers whatsoever. Image processing is a parallel workload, there is little to no branching involved. And it is not just filtering, but also blending, noise reduction, a wide variety of image effects - blur, sharpen, edge detection, shadows, transformations - you name it... image processing benefits from GPU compute tremendously. I think you are CONFUSING image processing with computer vision, which are completely different areas. CV can still benefit a lot from GPU compute, but certainly not as much as image processing.

OpenCL 2 massively improves the versatility of GPU compute. As someone, who does image, audio and video processing for a living, I am extremely excited about it becoming widely adopted. I've had to pay mad bucks in the past for "special purpose hardware" with DSPs for real time or accelerated media processing, but today's GPUs complete destroy those not only in terms of price performance ratio, but also peak performance.
saratoga4 - Monday, August 24, 2015 - link
>"lack of good fixed point hardware support (modern GPUs are floating point)" - shows what you know. GPUs have no problem with integers whatsoever.

Fixed point is not the same as integer. No offense, but you shouldn't be arguing about this if you don't know what the words mean ;)
name99 - Monday, August 24, 2015 - link
One day you're going to look back at this comment and wish you'd never said it...

How do you imagine fixed point matters materially from integer? The way everyone handles fixed point is to imagine a virtual binary point, which is usually effected by multiplying by pre-shifted coefficients. Then at the end of the process you shift right or something similar to round to integer.
Look at something like http://halicery.com/jpeg/idct.html for an example.

The only way a device might handle fixed point differently is if automatically downshifted appropriately when multiplying together two fixed point numbers. In principle Hexagon could do this, but I'm unaware of any devices that do this, beyond trivial cases like a multiply-high instruction that takes 16.16 input and returns the high 32-bits.
name99 - Monday, August 24, 2015 - link
What you're saying is substantially silly. Apple provides a large library of a variety of different types of image processing operations (including user provided filters) in Core Image; and these are implemented on the GPU on both OSX and iOS, apparently very fast and at low power.

Apple may well at some point put a DSP onto the A- chips (especially once the QC-supplied LLVM framework I described above is mature, and perhaps also once Swift is a little more mature), but their doesn't seem to have been a compelling argument so far, since the A4 at least.
There's also the issue that, while the argument I gave re memory access regularity is correct, there is the problem that insisting on a fixed-point only path is problematic. While a few algorithms easily match a fixed-point pathway (basically linear filters), as soon as you include any sort of non-linear processing you have to spend a whole lot of time constantly normalizing. If Apple were to include a DSP for upcoming areas of interest (like perceptrons for machine learning), they'd probably make it FP all the way (maybe 16bit FP, but still FP) rather than bother with a fixed point setup.
MrSpadge - Monday, August 24, 2015 - link
He was not talking about the PCIe latency, but about the instruction latency. I.e. the time it takes for instructions to complete. It's huge on GPUs. That's how they achieve their efficiency for massively parallel tasks. But it hurts them badly for any other tasks (branches).

And "it has more power" means nothing in the mobile world if the alternative has "enough" computing power and completes the task for (significantly) less energy. Also being flexible is nice, but no real advantage is the alternative can do the job well.
saratoga4 - Monday, August 24, 2015 - link
^^^ Exactly. Latency on GPUs is very high, and that has nothing to do with PCIe.

Its fine that you can take a 200 watt desktop GPU and run basic tasks like filtering with good performance, but you're still using a 200 watt GPU to do something that a DSP could do for orders of magnitude lower power consumption, which is one reason why GPUs are almost never used for image processing on mobile devices. You'll kill the battery. The other main reason being that inability of GPUs to handle high branching and non-parallel image processing tasks.
ddriver - Monday, August 24, 2015 - link
What do you mean by "very high". Microseconds, nanoseconds, milliseconds? Is it too high to be useful? I've gotten GPU compute to run on a millisecond resolution, and it could probably go even lower, now do tell, for what kind of tasks is ONE THOUSANDTH of a second too slow?

I've seen such arguments before, from people who "heard that GPU compute is unusable for audio processing because latency is too high" - few weeks later I had an audio engine processing offline audio at 20 times the rate, at which the CPU was capable of, and I haven't heard that argument since then. And that wasn't the best part, the CPU was pretty much sitting idle, giving it much more clocks to handle real time audio at a lower latency.

Image processing is in its nature a parallel task. But please, do tell at least several more times how GPUs can't handle branching, because I feel like the several times that was already mentioned are really not enough.

The figures for that DSP are on average twice as fast as the CPU at 8 times less power. Given that the GPU can easily outperform the CPU in parallel tasks at least 10 times EASY, at a comparable power envelope, I'd say it is about as efficient as the DSP, with the advantages of having more throughput, more features and more portable when it comes to programming it.

Last but not least, with OpenCL 2's SVM, believe it or not, but GPUs massively outperform CPUs even for stuff like binary tree search, which is in its nature a non parallel workload. So there is that too...
saratoga4 - Monday, August 24, 2015 - link
>Image processing is in its nature a parallel task.

haha yeah sure all image processing algorithms are parallel by their nature. If you really believe that I think we're unlikely to be able to discuss much of anything. Congrats on learning OpenCL though. I think you're letting it go to your head a bit though.
name99 - Monday, August 24, 2015 - link
"which is one reason why GPUs are almost never used for image processing on mobile devices"
I just told you GPUs ARE used for lotsa image processing on iOS.

The reason they aren't used on Android probably has to do with lack of a decent framework, and no standard hardware model for Google to target.
name99 - Monday, August 24, 2015 - link
"whereas with the DSP you must explicitly target it, and that code will do you no good on other platforms."

This is not exactly true. QC have contributed to LLVM a framework (currently in flux) for designating different pieces of code to run on different ISAs, along with the tools (bundling and linking) necessary for this to happen. The vision is that you write your app in a more-or-less constrained version of C (and presumably soon C++, maybe even Swift), annotate certain functions and memory blocks by their target ISA, and everything else just happens automatically.
Like everything it will take time to fully fall into place, but there is a scheme there that has general potential for anyone interested in accelerators, whether GPUs, DSPs, or more specialized devices.

As for GPU vs DSP, I'd say the difference is in how much irregularity the device expects. GPUs have to expect a lot of memory irregularity, and thus provide a lot of transistors to cope with that fact. DSPs target much more regular memory access, and don't have to spend those transistors and their related power costs. Of course this assumes all things being equal, and if 10x the money and resources are spent on GPU design, optimization, and compilers, then things will not be equal...
extide - Monday, August 24, 2015 - link
The DSP is actually TINY in die area, probably smaller than a single A53 core, for example. The picture in this article is just a logical representation, NOT an accurate floorplan at all.
extide - Monday, August 24, 2015 - link
EDIT Sorry meant smaller than single A57 -- not A53 :)
ddriver - Monday, August 24, 2015 - link
I doubt that, as they claim it is faster than a quad core CPU's SIMDs, as efficient as the DSP may be, it is neither magic nor a miracle. This performance cannot come from nowhere. IIRC the DSP benefit is that instructions do a lot more work, but it still takes transistors to do all that work, regardless of how many instructions it takes.
saratoga4 - Monday, August 24, 2015 - link
DSP cores are incredibly small. This is probably a lot smaller than an A57, maybe even smaller than an A53. Its not magic, its just not a general purpose processor, so it saves a huge amount of die area.
saratoga4 - Monday, August 24, 2015 - link
Those boxes definitely aren't to scale. The CPU and DSP are actually much smaller, the GPU is much bigger.
ddriver - Monday, August 24, 2015 - link
Yeah, it actually says that in fine print.
name99 - Monday, August 24, 2015 - link
That QC die photo is pure fantasy. It is a bunch of rectangles slapped on a background of a die. The rectangles not only don't represent the size of components, they don't even represent their shapeor position.
xdrol - Monday, August 24, 2015 - link
If you read the first 4 paragraphs and mentally replace the word DSP to GPU, you get an equally correct text, so so it seems that the two are more-less the same, but tuned for a bit different use-case.

I think it would be more fit to compare the DSP's architecture to a GPU arch than a CPU arch; GPUs are also often in-order, have a VLIW ISA, have multiple compute units, and have special instructions to accelerate key algorithms..
DanNeely - Monday, August 24, 2015 - link
While that was true of older GPU architectures; modern GPUs are much closer to CPUs in flexibility/capability than they are to DSPs.
xdrol - Monday, August 24, 2015 - link
Apart from AMD's GCN, all other GPUs are still in-order and use a VLIW ISA, including Intel Gen9, nVidia Kepler 2.0, Qualcomm Adreno 5xx or ARM Midgard (Not sure about Imagination 7XT, but I take an educated guess of "that too".) The "multiple unit" and "special instruction" parts are true for GCN also.
Drumsticks - Monday, August 24, 2015 - link
Did you mean Maxwell 2.0? Kepler definitely wouldn't be Nvidia's most modern architecture.
saratoga4 - Monday, August 24, 2015 - link
>If you read the first 4 paragraphs and mentally replace the word DSP to GPU, you get an equally correct text, so so it seems that the two are more-less the same, but tuned for a bit different use-case.

Apart from being specialized processors, they actually don't have much in common. "Not an OOOE CPU" encompasses a huge number of different types of processors, many of which are very different.

>I think it would be more fit to compare the DSP's architecture to a GPU arch than a CPU arch; GPUs are also often in-order, have a VLIW ISA, have multiple compute units, and have special instructions to accelerate key algorithms..

Aside from VLIW, all of those things would also be true of an A53 for instance. The main difference between a DSP and a GPU is that a GPU is designed for very high through across an enormous number of threads, whereas a DSP is designed around single-threaded throughput more like a CPU.
saratoga4 - Monday, August 24, 2015 - link
Actually no, Hexagon is much, much closer to a CPU then any modern GPU. You can boot linux on a Hexagon DSP with no CPU at all and linux application performance is actually quite good. Not as good as Krait, but as a general purpose CPU, Hexagon is surprisingly performant. Compared to a GPU, its orders of magnitude more capable.
name99 - Monday, August 24, 2015 - link
"I think it would be more fit to compare the DSP's architecture to a GPU arch than a CPU arch; GPUs are also often in-order, have a VLIW ISA, have multiple compute units, "

Both DSPs and GPUs are optimized to exploit correlated computation (ie doing basically the same thing on nearby data over and over). This is where they both differ from CPUs which are happy to exploit correlation when they find it, but which are optimized for algorithms and data with very little correlation.

Where DSPs and GPUs differ
- theoretically is that GPUs are optimized to handle irregular memory patterns, DSPs to handle regular memory patterns
- practically (ie this is a historical fact that could change, but hasn't yet) is that GPUs provide float support while DSPs generally do not
Wardrive86 - Friday, July 6, 2018 - link
Qualcomm added floating point support to Hexagon in 2013 (qdsp6v5) that's all snapdragon 800 processors, 400 series since the 410 and all 600 series since the 615. (qdsp6v6) for 820+, 660+, etc. All modern chipset with Hexagon have robust floating point support
nismotigerwvu - Monday, August 24, 2015 - link
Nice writeup Joshua, but there's a typo in the 3rd paragraph. It current says "These deisgn goals..." so it looks like your right hand is running a little faster than your left.
jjj - Monday, August 24, 2015 - link
Those numbers are not ideal.Comparing to GPU (coupled with A53 cores) would be more fitting since that's what others can do even in budget.
The sensor hub would be a better fit in watches but they don't have any SoCs for that and remains to be seen how it would do against a discrete hub targeted at that market.
MrSpadge - Monday, August 24, 2015 - link
Which numbers are "ideal" in chip design? change the application profile a bit and you'll get a different answer.
Brazos - Monday, August 24, 2015 - link
Will this solve -
"Android’s 10 Millisecond Problem: The Android Audio Path Latency "
http://superpowered.com/androidaudiopathlatency/#a...
saratoga4 - Monday, August 24, 2015 - link
No, this is totally unrelated. Audio latency is entirely about buffer sizes, with larger sizes being more power efficient but higher latency. At an API level, android has made the deliberate decision not to offer very-low latency audio support. The hardware is basically irrelevant.
Magic999 - Monday, August 24, 2015 - link
HVX has a similar concept of Neon but has 128Byte x 2 vector which is 16 time bigger than Neon (generally 128bit). Since, HVX is tied with DSP core and it can avoid high ARM activity and DDR bus traffic which change power rail to high performance. Higher power is the critical concern in mobile area. GPU is the good for general filter without many condition loop but it generally impacts to high power with higher ARM engagement. But HVX/DSP can work without ARM engagement which is the one of the benefit. HVX would be a good approach to meet mobile power&perfromance requiremnt.
name99 - Monday, August 24, 2015 - link
Are you going to attend and report on the Mars session this afternoon?
We are all interested in high many-core ARMv8 is about to play out...
lilmoe - Tuesday, August 25, 2015 - link
"It remains to be seen whether OEMs using Snapdragon 820 will use these DSPs to the fullest extent"

................................................
Wardrive86 - Friday, July 6, 2018 - link
Important to note that the Hexagon 600 series (qdsp6v6) can do floating point, in fact up to 8 flops per clock, but that there is no HVX floating point support.

Qualcomm Details Hexagon 680 DSP in Snapdragon 820: Accelerated Imaging

Post Your Comment

41 Comments

Back to Article

madwolfa - Monday, August 24, 2015 - link

sandy105 - Monday, August 24, 2015 - link

ddriver - Monday, August 24, 2015 - link

ddriver - Monday, August 24, 2015 - link

saratoga4 - Monday, August 24, 2015 - link

ddriver - Monday, August 24, 2015 - link

saratoga4 - Monday, August 24, 2015 - link

ddriver - Monday, August 24, 2015 - link

saratoga4 - Monday, August 24, 2015 - link

name99 - Monday, August 24, 2015 - link

name99 - Monday, August 24, 2015 - link

MrSpadge - Monday, August 24, 2015 - link

saratoga4 - Monday, August 24, 2015 - link

ddriver - Monday, August 24, 2015 - link

saratoga4 - Monday, August 24, 2015 - link

name99 - Monday, August 24, 2015 - link

name99 - Monday, August 24, 2015 - link

extide - Monday, August 24, 2015 - link

extide - Monday, August 24, 2015 - link

ddriver - Monday, August 24, 2015 - link

saratoga4 - Monday, August 24, 2015 - link

saratoga4 - Monday, August 24, 2015 - link

ddriver - Monday, August 24, 2015 - link

name99 - Monday, August 24, 2015 - link

xdrol - Monday, August 24, 2015 - link

DanNeely - Monday, August 24, 2015 - link

xdrol - Monday, August 24, 2015 - link

Drumsticks - Monday, August 24, 2015 - link

saratoga4 - Monday, August 24, 2015 - link

saratoga4 - Monday, August 24, 2015 - link

name99 - Monday, August 24, 2015 - link

Wardrive86 - Friday, July 6, 2018 - link

nismotigerwvu - Monday, August 24, 2015 - link

jjj - Monday, August 24, 2015 - link

MrSpadge - Monday, August 24, 2015 - link

Brazos - Monday, August 24, 2015 - link

saratoga4 - Monday, August 24, 2015 - link

Magic999 - Monday, August 24, 2015 - link

name99 - Monday, August 24, 2015 - link

lilmoe - Tuesday, August 25, 2015 - link

Wardrive86 - Friday, July 6, 2018 - link

Log in

Don't have an account? Sign up now