Original Link: https://www.anandtech.com/show/8933/snapdragon-810-performance-preview
Understanding Qualcomm's Snapdragon 810: Performance Preview
by Joshua Ho & Andrei Frumusanu on February 12, 2015 9:00 AM EST- Posted in
- Qualcomm
- Mobile
- Gobi
- SoCs
- Snapdragon 810
While we can dance around the issue, it’s impossible to have any real discussion about Snapdragon 810 without addressing the flurry of rumors that have surrounded this SoC. There have been rumors of overheating, delays, and all sorts of defects. In light of this, the Snapdragon 810 and its performance has been the subject of intense interest. In order to learn more, we recently met with Qualcomm to do a deep dive on the Snapdragon 810 and properly benchmark it for comparison against other SoCs.
While those that have followed the SoC market closely are likely to be quite familiar with the Snapdragon 810, it’s still worth going over the basics of the SoC before diving into aspects such as performance. In general, the area of greatest focus and one of the most important aspects of any SoC is the application processor. In the case of the Snapdragon 810, Qualcomm has licensed ARM’s Cortex A57 and A53 architectures for the CPU, which we’ve previously discussed in depth in our review of the Galaxy Note 4 Exynos. The Snapdragon 810 comes with the A57 cluster clocked at 1958 MHz and the A53 cluster at 1555 MHz.
Qualcomm Snapdragon S810 Specifications | |||
SoC | Snapdragon 810 | Snapdragon 805 | Samsung Exynos 5433 |
CPU | 4x Cortex [email protected] 4x Cortex A57 r1p1 @1.958GHz 2MB L2 cache |
4x Krait [email protected] 4x512KB L2 cache |
4x Cortex A53 [email protected] 512KB L2 cache 4x Cortex A57 r1p0 @1.9GHz 2MB L2 cache |
Memory Controller |
2x 32-bit @ 1555MHz LPDDR4 24.8GB/s b/w |
4x 32-bit @ 800MHz LPDDR3 25.6GB/s b/w |
2x 32-bit @ 825MHz LPDDR3 13.2GB/s b/w |
GPU | Adreno 430 @ 600MHz |
Adreno 420 @ 600MHz |
Mali T760MP6 @ 700MHz |
Mfc. Process |
TSMC 20nm SoC |
TSMC 28nm HPm |
Samsung 20nm HKMG |
For the most part, Qualcomm seems to have adopted a relatively similar approach by using a 4+4 big.LITTLE design, which means that four Cortex A57s serve as the “high power” cores, and four Cortex A53s work as the “low power” cores, with a CCI-400 to allow for cache coherency between the two clusters. However, while the architecture is licensed from ARM the actual implementation of the logic has been optimized by Qualcomm to improve performance and/or power consumption. Like most recent big.LITTLE SoCs, Qualcomm’s Snapdragon 810 has all eight cores exposed to applications, and relies upon task scheduling mechanisms to decide how to place threads on each core. However Qualcomm, as opposed to all other licensees of big.LITTLE, has decided to stray away from ARM's and Linaro's software implementation, and we'll be scratching the surface of what this means in terms of power and performance on the Snapdragon 810.
Outside of the SoC, Qualcomm has integrated an Adreno 430 GPU, which is said to deliver a performance improvement of 30%, possibly more for a shader-heavy workload. Outside of this the Adreno GPUs continue to be a black box in terms of technical detail. Qualcomm states that this GPU wasn’t a straight extension of the Adreno 420 which suggests that there have been architectural changes to the GPU, although we weren’t told what they were. We should be seeing final clocks running at 600MHz, meaning the GPU is running at the same frequency as the Adreno 420 in Snapdragon 805 devices.
To feed these components and the rest of the SoC, Qualcomm has fitted the SoC with a dual-channel 32-bit (total 64-bit) wide LPDDR4-1555 memory interface, which means a peak of 24.9 GB/s in memory bandwidth and basically maintains parity with the Snapdragon 805 despite the reduced bus width. The move to LPDDR4 should also bring a reduction in power consumption of the memory interface of up to 20% when compared to LPDDR3.
Outside of the GPU and CPU, Qualcomm has focused on iterating all other aspects of this SoC. The Snapdragon 810 represents Qualcomm’s first high-end SoC with an HEVC encoder, which is said to support up to 4K30 although we were not told the exact bit rate limits or any other encoder settings. Along the same lines, Qualcomm has upgraded the ISP in the Snapdragon 810 to a “14-bit” dual ISP, which allows for features such as multiple cameras for depth mapping or other computational photography features. This new ISP in the Snapdragon 810 can process 1.2 GP/s, in contrast with the 1 GP/s of the Snapdragon 805’s ISP. As mentioned in previous articles, this ISP is clocked at 600 MHz. The audio codec is the WCD9330 which is carried over from the Snapdragon 805.
On the RF side of things, Qualcomm is introducing a new category 9 modem that is built into the Snapdragon 810. In our experience, an integrated modem does improve battery life, although in practice these benefits will likely be difficult to distinguish from a multitude of other factors on total battery life. While one might guess that this is similar to the MDM9x45 external modem, it seems that there may not be enough bandwidth to support both upload and 3x download carrier aggregation. We also see a new suite of RF360 parts to accompany the Snapdragon 810, which include an antenna tuner, CMOS PA/antenna switch, and envelope tracker. Outside of the WTR3925 transceiver that was introduced with the Snapdragon 805, we see a new WTR3905 companion chip for 3x download CA and upload CA. In addition, we see a new variant of the QCA6174 WiFi chip, the QCA6174A, which enables MU-MIMO and a separate chip to enable 802.11ad.
RF
The basics can be all that's necessary to cover the RF changes in the Snapdragon 810 platform, but now is as good a time as any to really get down to the details of how this all works. For a while now, RF has been a black box. We’ve done some work on demystifying some aspects of RF, but there’s still quite a bit left to cover. While we have covered parts of RF systems like the envelope tracker, that’s only one piece of the puzzle. As a front-end solution, we’re still missing a great deal of nuance on the CMOS PA and integrated switch, along with the dynamic antenna tuner. In addition, there’s quite a bit in the pipeline that has come out since our last article on the state of Qualcomm’s RF components. WTR3925 and MDM9x35 have been shipping in mobile devices for a while, and we’re on the cusp of seeing new modems like Qualcomm’s MDM9x45 so there’s no better time to talk about all of this.
For those that are unfamiliar with how radio works at a high level it’s well worth going over in order to understand how everything comes together. At the basic level, on the receive path we can start at the antenna. The antenna is rather simple, and its goal is to convert radio waves into electrical energy. There’s definitely a lot more to this area, but for now that’s really all the knowledge that’s necessary. From there, the next step in the path is an antenna switch, which is used to select the right path for receive and transmit depending on the band used. A duplexer is the next step in the chain and is used to allow transmit and receive to be split into two separate parts. Before we get to the transceiver itself, filters (ideally) strip out any received signal that is out of the desired frequency band.
Once we get to the transceiver, a low noise amplifier takes the relatively weak signal from the filter and boosts it. After this, a down-converter converts the frequency of the signal to a baseband frequency by using a local oscillator that generates a signal that is mixed with the incoming signal. This is necessary because the signal is coming in at anywhere from 700 MHz to 5 GHz, which is almost impossible to process in real time with a relatively low-clocked DSP. In addition, this makes it easier to reject noise and due to the conversion in frequency it’s much easier to design an amplifier for this signal. That’s exactly what happens after this down-conversion. The signal is then split into the in-phase and quadrature components to make signal processing simpler. Another amplifier boosts the signal and then it reaches the baseband. At the baseband, an analog to digital converter processes the signal, and then the signal is demodulated. Once this is accomplished, the rest of the system simply sees the information as if it were packets of data in a format like TCP/IP.
The path for transmission is similar, although there are a few modifications on that end. Starting from the baseband, the information is modulated into a specific format, then converted from a digital signal to an analog one as it leaves the baseband. From here, the signal travels through another set of amplifiers before it is combined and raised to the frequency needed for transmission in the up-converter. At this point, another driver amplifier is used to amplify the signal before it leaves the transceiver. There’s currently not much need to understand what the driver amplifier does other than to know that it exists, so don’t worry about that for now. What’s definitely important is the power amplifier. This is the point where the signal is driven from the relatively low levels in the transceiver and baseband to high enough power to contact a cell tower. After this is done, the signal goes through the duplexer, through an antenna switch, out to the antenna.
At a high level, that’s how things work. To break things down into the simplest form, there are two distinct sections. The RF front end and baseband. The front end is designed to accurately capture as much information from the antennas as possible and filter it down to a form that the baseband can handle. The baseband is where all information is processed after the front end and receiver, and acts as the control center for rest of the RF system. It’s definitely a lot to take in, but it will help a lot with understanding the relevance of RF360, WTR3925, and MDM9x35.
QFE2550 Antenna Tuner
One of the first areas to explore would be the RF front end, and before we jump into deep discussions we’ll tackle some of the simpler sections first. The QFE25xx antenna tuner is similar to the envelope tracker in the sense that it’s normally not part of the RF front end. I never actually mentioned an envelope tracker or antenna tuner at any step in the introduction. This is because these parts are not part of the basic superheterodyne radio system. However, like envelope tracking, the need for more battery life, faster data, and superior reception has driven the development of new technologies.
To understand how this antenna tuner works, we must introduce the concept of impedance matching and impedance. Impedance seems difficult at first, but is really just a form of resistance in an AC circuit for the purposes of understanding this article. The three components in a circuit that affect impedance are resistors, capacitors, and inductors. Impedance matching is exactly what it sounds like, which is equalizing the impedance at a junction. In transferring energy from an antenna to the RF front end, we must match the impedance between the antenna and front end. This is because if the impedances are mismatched, signal reflects back to the source. In other words, on the receiving side reception becomes weaker if impedance is mismatched, and on the transmission side energy is wasted. One can liken this to a glossy display, as while the vast majority of the light goes through the glass cover, some light is reflected back.
Of course, in the factory the RF system is carefully tuned to ensure that impedance mismatch is as low as possible. However, the real world makes for a difficult situation. The iPhone 4’s death grip behavior is a classic example of how real-world use can disrupt this impedance matching. By bridging two gaps in the metal ring of the iPhone 4, the antenna was detuned and its impedance was altered. As a result, signal became noticeably worse. Combined with the compressed signal representation, this made it possible for a “decent signal” to be completely lost by touching the right place on the phone.
This is where the QFE2550 antenna tuner and similar systems come in. By acting as a voltage-controlled variable capacitor, the baseband can be loaded with information that allows it to predict how much mismatch correction is needed for each detected scenario, which is accomplished with various sensors that can include capacitive touch sensors. Each situation is compensated for by using pre-loaded corrections that are loaded when a given scenario is detected based on frequency change or body loading scenarios. As a result, the efficiency of the antenna is restored as seen in the photo above, although the presence of body loading will inevitably reduce peak power and sensitivity. Such a system can be combined with more information, such as capacitive touch sensors or reflected power measurements to ensure maximum responsiveness and performance. This has the greatest benefit in enabling phones with all-metal unibodies, although other tools such as antenna switched diversity and MIMO can be used to ensure peak reception.
QFE33xx CMOS PA/Switch
Normally, power amplifiers are not worth talking about. However, with the introduction of RF360 Qualcomm has bandied about the concept of a CMOS PA to cover a wide swath of bands rather than just a few. While this sounds interesting, there’s a great deal more nuance to this issue than simply band coverage. First, semiconductor/solid state physics is required to really understand where the debate lies. Of course, there’s really no time to go over this in major depth, but there are a few basics. The transistor is a switch that controls the flow of current. However, there is a limit to how much current can be carried, and there are a few regions in which the relationship between voltage and current differ. The two that we’ll talk about are the linear region and saturation region. The linear region is exactly what it sounds like. Voltage and current are linearly related, following Ohm’s law. The saturation region is where this falls apart, and more voltage is needed to increase current by the same amount, and we see diminishing returns until maximum current is reached.
"IvsV mosfet" by User:CyrilB. Licensed under CC BY-SA 3.0 via Wikimedia Commons
So this is where we see a fundamental difference in the implementation of a CMOS PA and GaAs PA. Gallium arsenide has higher electron mobility, even in saturation mode. This makes it easier for GaAs circuits to switch at incredibly high frequencies such as 60 GHz for WiGig/802.11ad. In addition, unlike silicon-based transistors, gallium arsenide transistors are generally heat-insensitive, and pure GaAs has high resistance and dielectric constant, which means that it serves as an excellent substrate for various components for the same reasons that silicon-on-insulator (SOI) is a good substrate for CMOS logic. In addition, GaAs-based transistors are highly linear in behavior when compared to CMOS technologies, so a GaAs PA can operate much closer to maximum current without clipping the signal.
However, GaAs is not perfect. For one, CMOS logic is impossible to implement using current technologies. This is because unlike current CMOS technology where there are NMOS and PMOS transistors, GaAs-based transistors do not have a p-channel equivalent. As a result, the controls available over a GaAs PA are significantly cruder than what is possible with a CMOS PA. GaAs PAs are also noticeably less efficient at low power levels, also known as the backoff condition. Therefore, CMOS PAs tend to be more power efficient at lower power states as they can have multiple “maximum power states” to scale the PA as needed. In practice, Qualcomm claims that we’re still looking at around a 5-10% efficiency delta at max power when compared to GaAs PAs, which means that CMOS PAs are close to GaAs PAs in efficiency.
Despite the differences in efficiency, a CMOS PA is still a valid option in smartphones due to the fact that a single PA can cover far more bands than an equivalent GaAs PA, as the PAE curve across a spread of frequencies is relatively flat compared to a GaAs PA which is effectively a single peak. In addition, the fact that the PA is built on CMOS means that there is additional integration capability. In its current incarnation, the QFE33xx already has an integrated antenna switch, and there is potential for greater integration with parts that share a similar process node.
This concludes RF360, which gives a broad survey of what’s on the market today. Combined with our previous piece on MDM9x25, it’s possible to get a good idea of the current state of the market. However, the latest and greatest modems and transceivers mean that it’s time to talk about UE category 6, 9, and 10 LTE and the various challenges that come with new capabilities.
WTR3925
There are a few things that are important when talking about a transceiver. To recap, transceivers have a few key elements. On the receive side, we see the need for low noise amplifiers, down-converters, and narrow-band amplifiers. On the transmit side, we need a driver amplifier, up-converter, and another set of narrow-band amplifiers. While most of RF360 is built on relatively old process nodes for CMOS technology, the transceiver can be built on newer CMOS processes because it’s doesn’t have to handle the level of signal that the rest of the front end does.
At a high level, the WTR3925 really brings two new capabilities to the table. First, it does away with the need for a companion transceiver in order to achieve carrier aggregation, which the WTR1625L/WFR1620 combination provided. It seems that this is due to the need for additional ports on the transceiver, which the WTR1625L lacked. The other improvement is that WTR3925 moves to a new 28nm RF process, as opposed to the 65nm RF process used for the WTR1625L.
As a quick aside, RF processes are largely similar to CMOS processes, although with a few modifications. These changes can be thicker metal in interconnects between transistors and memcaps, which are analogous to capacitors in DRAM. Qualcomm claims that this will drive down power consumption, however this is a product of a new architecture that takes advantage of the smaller process node. Unlike digital logic such as what we see on the baseband, RF does not directly benefit from scaling to lower processes. In fact, there is a chance that scaling to lower process nodes can hurt power consumption because even though the transistor can operate faster, there is more noise As a result of this noise, the amplifiers in the transceivers may need more stages and more power in order to achieve the same noise figure.
MDM9x35
While baseband was previously one of the most popular topics in RF, as can be seen by this article RF is much more than just the baseband. However, the baseband is a critical part of the chain. The RF front end is critical for reception and a myriad of other issues, but feature support and control of the front-end lies with the baseband. The baseband must properly interpret the information that the front-end provides and also send out information to the front-end to transmit.
Fortunately, the baseband is implemented with digital logic, so there are significant benefits to moving to the latest and greatest CMOS process node. Lower voltage (and therefore power) is needed to drive the transistors, and it becomes easier to drive higher performance in the DSP. In the case of the MDM9x35, we see that there's a QDSP clocked at 800 MHz for modem functions, and a 1.2 GHz Cortex A7 for functions such as mobile hotspot.
In the case of MDM9x35, there are two major contributors to the reduction in power consumption. The first is the move from 28nm HPm to 20nm SoC. While 20nm SoC doesn’t utilize FinFET, we still see scaling in power, performance, and density. The other area where we see power savings is better implementation of various algorithms. As a result, we should see around 20-25% power savings with the same workload.
MDM9x45
In the time since the first MDM9x35 devices were launched, Qualcomm has also iterated on modems. With the 9x45 generation, we see a move to category 10 LTE, which includes 450 Mbps maximum download speed when aggregating three 20 MHz carriers, and two 20 MHz carriers on the uplink for a maximum of 100 Mbps. Although the Snapdragon 810 doesn't have a 9x45 IP block for the modem, the Snapdragon 810 does support up a maximum of 450 Mbps for download with category 9 LTE. However, there is no uplink carrier aggregation in such a scenario. Uplink carrier aggregation is only possible with category 7, which limits downlink speeds to 300 Mbps.
Qualcomm claims that the MDM9x45 should bring around 40% energy savings in an LTE carrier aggregation scenario when compared to the MDM9x25 modem. In addition, these new modems bring in a new generation of GNSS location, with support for EU's Galileo constellation. It's likely that the DSPs and other aspects of this modem have been beefed up relative to the 9x35 and 8994 modems to enable category 10 data rates.
Energy Aware Scheduler
Although big.LITTLE has been around for a few years at this point, it’s still worth going over the basics of big.LITTLE in mobile SoCs. Fundamentally, the smartphone SoC space has benefited greatly from playing catch-up to the PC space. At first, SoCs were on lagging process nodes and CPUs were simple and almost entirely in-order in nature. For the first few years, doubling CPU performance every year was possible by adding additional cores, increasing clock speeds, widening the pipeline, and jumping down a process node.
Once we reached the limit for optimizing in-order architectures, the only way to improve performance in a meaningful way was to focus on avoiding stalls in the CPU pipeline. In an in-order CPU architecture, any missing information for executing an instruction means that the CPU must wait for the information to arrive from DRAM or some other storage. Even if a CPU executes incredibly quickly otherwise, it is stuck waiting on dependencies that can significantly degrade performance.
The solution is to execute operations out-of-order. After all, if you have to build a PC, you don’t sit around waiting a few weeks for a graphics card to arrive before building the rest of the PC. Similarly, modern CPUs execute instructions out-of-order to improve performance and avoid stalls. However, implementing this logic in a CPU is far from a trivial matter, as a CPU has to be designed to know which instructions can be executed out-of-order and which must be executed in order. Even instructions with dependencies that have yet to be resolved can be executed speculatively, which can save a great deal of time if the results of this speculation are used. As a result of this speculative execution and the logic needed to implement out-of-order execution (OoOE), the number of transistors and interconnects grows dramatically, which means power consumption grows dramatically as well.
It is in the context of this fundamental problem that big.LITTLE came to be. While there are multiple solutions to solving the power problem that comes with OoOE, ARM currently sees big.LITTLE as the best solution. Fundamentally, big.LITTLE seeks to use in-order, low power processors for the vast majority of computing in mobile, but switches tasks to big, out-of-order processors when a task is too much for the little cores to handle. In theory, this seems to be the ideal solution as it makes it possible to retain the power-saving advantages of in-order cores and the performance advantages of big OoOE cores.
Meanwhile it should be noted that there are other ways of using two heterogeneous CPU clusters, such as cluster migration, which was employed in the first Samsung Exynos bL SoCs and is still employed in NVIDIA's Tegra X1. But for now we will only focus on big.LITTLE HMP operation, which allows all cores to be active and exposed to the operating system. To translate this simple idea into reality is a difficult task. Currently, the de-facto solution in the mobile space for big.LITTLE is the ARM and Linaro developed Global Task Scheduling, which relies on a per-entity load tracking (PELT) mechanism with two load thresholds that decide if a process should be migrated to a corresponding cluster.
There’s a significant amount of terminology flying around regarding how this works, and we've covered the mechanic in our Huawei Honor 6 review and more in depth in our recent ARM A57/A53 investigation of the Galaxy Note 4 with the Exynos 5433. To recap, at its core, the per-entity load tracking is the main mechanism at hand needed to make thread placement work in GTS.
This system is designed to track load per task by weighting recent load the greatest, and slowly reducing the impact of previous load by a decay factor, which is a geometric series by default. Unfortunately, this load metric does have some disadvantages. Primarily, if a task idles for a long period of time and suddenly demands a significant load, in a race to sleep scenario per-entity load tracking can take a significant amount of time to reach a given up differential to migrate a thread from the small cores to the big cores. Similarly, it can take a significant time for a system with per-entity load tracking system to view a task that has reached idle to migrate a task down to the little cores. This system is also completely unaware of the real-world energy characteristics of the CPU cores, as load is the only real consideration that comes into the scheduler.
For the Snapdragon 810, Qualcomm has fundamentally done away with per-entity load tracking, and uses a window-based system instead. While we weren’t told the size of each window or how many non-idle windows were retained, the load tracking system uses the average of load across all of the recent windows while also looking at the recent maximum value to determine if there is a task that suddenly requires a significant amount of CPU power. This means that there’s a much shorter waiting period for core scheduling when a thread that goes from mostly or completely idle to a high load or vice versa.
This scenario is common throughout smartphones and can be as simple as reading a web page and then opening a new link. The average metric over all of the windows is used to determine whether a thread needs to continue to run on a big core, or whether it can be safely moved down to a little core. In addition, this window-based system accounts for cases where cores are throttled from their maximum frequencies, which means that processes may stay on little cores even if the load for a task is high for a little core if it would perform worse on a throttled big core.
While there are some areas where we can compare and contrast current GTS solutions and Qualcomm’s solution for the Snapdragon 810, there are areas where no comparisons can be made at all. Although ARM is working on an energy model and an energy aware scheduler, we haven’t seen this working in any shipping SoC.
For the Snapdragon 810, there is an energy model for all CPUs that controls for changing power consumption with temperature and can provide a metric of performance per watt at all frequency states. However, unlike ARM’s energy cost model there’s no tracking for the power cost of a task that increases frequency on the cluster (synchronous architectures require that all CPUs run at the same frequency), nor are wake ups tracked and accounted for in energy modeling.
To be fair, there are a lot of aspects that are shared with the latest GTS mechanism such as packing small tasks onto already awake CPUs in order to avoid the cost of waking a CPU from a power collapse state. However, on the Snapdragon 810 there are evaluations throughout the execution of task to be able to move a task to a big core if its load increases from when the task first started, or if it’s necessary to move a thread from one big core to another big core depending upon the perf/W for each big core. In addition, if a single core is running a small load or task the scheduler can move the thread to another core and allow the other core to go to sleep and save power. The scheduler is also said to be aware of the power cost of migration.
Finally, the scheduler in Snapdragon 810 is used to help guide the CPU frequency governor policy by notifying the frequency governor appropriately to avoid cases where a task is migrated to another CPU and causes inefficient behavior. For example, if a task is at 100% load and is migrated in the middle of a sampling window to another core, the original core isn’t kept at an unnecessarily high frequency, and the core that the task was migrated to will go to the right frequency for the aggregate load of the task. This appears to be somewhat of a mitigation for the window-based system, as ARM’s scheduler uses events to handle these issues without having to resort these patchwork fixes.
In terms of how the power arbitration is actually implemented compared to traditional power management mechanisms in existing SoCs, Qualcomm replaces the old Intel-developed CPUIdle "Menu" and "Ladder" governors. These worked based on the achieved and target residency time of the individual idle states of a CPU core. Qualcomm's solution is a completely new approach (called the Low-Power-Mode CPUIdle driver, or LPM) as it ignores the time characteristic in its entirety and looks only at energy modeling. For this, the SoC's drivers need to have precise arbitration data to be able to properly model the SoC's real power consumption without actually measuring it. Thankfully Qualcomm does this, and it's the most complete model of a commercially available SoC's power characteristics to date.
We not only see the energy models for the various CPU and cluster idle states, but also the idle states of the CCI, something which is lacking in GTS's software stack.
Ultimately, while it’s clear that Qualcomm’s solution to the big.LITTLE problem has its inefficiencies, their solution appears to be far superior to anything else with big.LITTLE on the market. And as previously mentioned in our Note 4 Exynos review, ARM’s energy aware scheduler is still far from implementation on a shipping SoC. This issue is only compounded by ARM’s need to make a solution that works for all big.LITTLE SoCs, and OEM adoption is often slow in these scenarios. While the Snapdragon 810 could be behind other SoCs in process technology, advantages in areas such as the thread scheduler could narrow the gap.
CPU/System Performance
While talking about the energy aware scheduler and various other aspects of the Snapdragon 810 is helpful to understand how the SoC works, ultimately we must look at performance to determine whether Qualcomm's work to differentiate their SoC was worthwhile or not. To do this, we ran Qualcomm's tablet Mobile Development Platform (MDP) through our standard suite of benchmarks, although I was unable to run benchmarks such as BaseMark X and PCMark due to odd issues with the tablet.
Starting off, we have a complete breakdown of GeekBench 3 scores. The scores are closer to theoretical performance than real-world performance, but it's very useful for highlighting architectural changes. For our look at GeekBench we are comparing the 810 reference platform to our results from our recent Galaxy Note 4 Exynos review, along with results for the Snapdragon 805-based Galaxy Note 4 taken from the GeekBench database.
GeekBench 3 - Integer Performance | |||||||
Snapdragon 805 (ARMv7) | Exynos 5433 (AArch32) | Snapdragon 810 (AArch64) | S810 > S805 % Advantage | ||||
AES ST | 85.4 MB/s | 1330 MB/s | 604.9 MB/s | 608% | |||
AES MT | 350.4 MB/s | 4260 MB/s | 3050MB/s | 770% | |||
Twofish ST | 94.0 MB/s | 81.9 MB/s | 85.7 MB/s | -8.8% | |||
Twofish MT | 329.8 MB/s | 440.5 MB/s | 448.5 MB/s | 36% | |||
SHA1 ST | 202.1 MB/s | 464.2 MB/s | 428.1 MB/s | 112% | |||
SHA1 MT | 806.1 MB/s | 2020 MB/s | 3019 MB/s | 275% | |||
SHA2 ST | 95.1 MB/s | 121.9 MB/s | 81 MB/s | -15% | |||
SHA2 MT | 367.3 MB/s | 528.3 MB/s | 393.4 MB/s | 7.1% | |||
BZip2Comp ST | 4.46 MB/s | 4.88 MB/s | 4.99 MB/s | 12% | |||
BZip2Comp MT | 15.5 MB/s | 19.3 MB/s | 20.5 MB/s | 32% | |||
Bzip2Decomp ST | 6.43 MB/s | 7.41 MB/s | 7.99 MB/s | 24% | |||
Bzip2Decomp MT | 21.7 MB/s | 29.7 MB/s | 30.8 MB/s | 42% | |||
JPG Comp ST | 20.4 MPs | 19.3 MPs | 18.9 MP/s | -7.4% | |||
JPG Comp MT | 79.9 MP/s | 88.8 MP/s | 88.9 MP/s | 11% | |||
JPG Decomp ST | 30.6 MP/s | 43.5 MP/s | 36.3 MP/s | 19% | |||
JPG Decomp MT | 115.7 MP/s | 149.6 MP/s | 182.7 MP/s | 58% | |||
PNG Comp ST | 0.82 MP/s | 1.11 MP/s | 1.11 MP/s | 35% | |||
PNG Comp MT | 3.01 MP/s | 4.57 MP/s | 4.78 MP/s | 59% | |||
PNG Decomp ST | 18.7 MP/s | 19.1 MP/s | 15.6 MP/s | -17% | |||
PNG Decomp MT | 63.7 MP/s | 78.8 MP/s | 94.1 MP/s | 48% | |||
Sobel ST | 39.2 MP/s | 58.6 MP/s | 53.3 MP/s | 36% | |||
Sobel MT | 128 MP/s | 221.3 MP/s | 248.4 MP/s | 94% | |||
Lua ST | 0.92 MB/s | 1.24 MB/s | 1.30 MB/s | 41% | |||
Lua MT | 1.36 MB/s | 2.48 MB/s | 5.93 MB/s | 336% | |||
Dijkstra ST | 4.46 Mpairs/s | 5.23 Mpairs/s | 3.38 Mpairs/s | -24% | |||
Dijkstra MT | 13.2 Mpairs/s | 17.1 Mpairs/s | 13.7 Mpairs/s | 3.8% |
Thanks in large part to the new cryptographical capabilities of the ARMv8 cores, Snapdragon 810 gets off to a very good start in GeekBench 3's integer benchmarks. Once we move on to the rest of our benchmarks, we find that 810 continues to hold a considerable advantage through most of these benchmarks; BZip2 decompression, Lua script performance, and JPEG decompression all show considerable performance gains over the Snapdragon 805 based Galaxy Note 4. Snapdragon 810's overall performance improvement here is a rather large 45%, though if we throw out the especially large gains that come from Lua MT, the overall performance advantage is closer to 30%.
There are a few cases where performance regresses however, including in PNG decompression and Dijkstra's algorithm. This could be a result of memory performance (more on that later) or architectural differences. It's worth pointing out that these cases are also among the only cases where Snapdragon 810 notably trails the Exynos 5433.
GeekBench 3 - Floating Point Performance | |||||||
Snapdragon 805 (ARMv7) | Exynos 5433 (AArch32) | Snapdragon 810 (AArch64) | S810 > S805 % Advantage |
||||
BlackScholes ST | 4.33 Mnodes/s | 4.37 Mnodes/s | 5.01 Mnodes/s | 16% | |||
BlackScholes MT | 17.0 Mnodes/s | 20.4 Mnodes/s | 25.5 Mnodes/s | 50% | |||
Mandelbrot ST | 0.87 GFLOPS | 1.14 GFLOPS | 1.20 GFLOPS | 38% | |||
Mandelbrot MT | 3.45 GFLOPS | 5.09 GFLOPS | 6.41 GFLOPS | 86% | |||
Sharpen Filter ST | 886 MFLOPS | 1030 MFLOPS | 1007 MFLOPS | 14% | |||
Sharpen Filter MT | 3.54 GFLOPS | 4.31 GFLOPS | 5.02 GFLOPS | 42% | |||
Blur ST | 1.18 GFLOPS | 1.27 GFLOPS | 1.26 GFLOPS | 6.8% | |||
Blur MT | 4.67 GFLOPS | 5.03 GFLOPS | 6.14 GFLOPS | 31% | |||
SGEMM ST | 2.82 GFLOPS | 1.81 GFLOPS | 2.29 GFLOPS | -19% | |||
SGEMM MT | 8.05 GFLOPS | 6.1 GFLOPS | 6.12 GFLOPS | -24% | |||
DGEMM ST | 0.81 GFLOPS | 0.57 GFLOPS | 1.03 GFLOPS | 27% | |||
DGEMM MT | 2.69 GFLOPS | 2.29 GFLOPS | 2.81 GFLOPS | 4.5% | |||
SFFT ST | 1.16 GFLOPS | 1.1 GFLOPS | 1.25 GFLOPS | 7.8% | |||
SFFT MT | 4.55 GFLOPS | 4.56 GFLOPS | 4.11 GFLOPS | -9.7% | |||
DFFT ST | 0.47 GFLOPS | 1.02 GFLOPS | 1.03 GFLOPS | 119% | |||
DFFT MT | 1.89 GFLOPS | 3.46 GFLOPS | 2.97 GFLOPS | 57% | |||
N-Bod ST | 331.5 Kpairs/s | 370.4 Kpairs/s | 486.6 Kpairs/s | 47% | |||
N-Bod MT | 1.12 Mpairs/s | 1.44 Mpairs/s | 1.72 Mpairs/s | 54% | |||
Ray Trace ST | 1.48 MP/s | 1.7 MP/s | 1.73 MP/s | 17% | |||
Ray Trace MT | 5.77 MP/s | 6.65 MP/s | 8.16 MP/s | 41% |
GeekBench's floating point performance shows a similar range of performance increases. More often than not multi-threaded performance gains exceed single-threaded performance gains, which is a hopeful sign for how well the Snapdragon 810 reference platform can hold up when all four big cores are being hammered. Otherwise the 810 shows considerable performance gains on almost every benchmark here, the sole exception being SGEMM performance.
In this case Snapdragon 810 performance is relatively close to Exynos 5433 performance even though it has the advantage of running in AArch64 mode, which should give the FP numbers a boost over the Exynos. The SGEMM test is likely an isolated case where the Krait architecture and Snapdragon 805's high clock speed play to its favor. The overall Snapdragon 810 performance improvement is 30%, almost exactly what we saw with GeekBench integer performance as well (after throwing out Lua MT).
GeekBench 3 - Memory Performance | ||||||
Snapdragon 805 (ARMv7) | Exynos 5433 (AArch32) | Snapdragon 810 (AArch64) | S810 > S805 % Advantage |
|||
Stream Copy ST | 8.37 GB/s | 5.56 GB/s | 6.02 GB/s | -28% | ||
Stream Copy MT | 10.2 GB/s | 5.80 GB/s | 7.57 GB/s | -26% | ||
Stream Scale ST | 5.17 GB/s | 4.98 GB/s | 6.61 GB/s | 28% | ||
Stream Scale MT | 8.05 GB/s | 5.77 GB/s | 7.37 GB/s | -8.4% | ||
Stream Add ST | 5.06 GB/s | 4.85 GB/s | 5.64 GB/s | 11% | ||
Stream Add MT | 7.46 GB/s | 5.72 GB/s | 6.62 GB/s | -11% | ||
Stream Triad ST | 5.37 GB/s | 4.82 GB/s | 5.6 GB/s | 4.3% | ||
Stream Triad MT | 8.20 GB/s | 5.73 GB/s | 6.63 GB/s | -19% |
Usually we don't like to post the GeekBench memory scores, but in this case there is an interesting phenomenon going on with the Snapdragon 810. Although the LPDDR4 memory running at 1555MHz gives the SoC a large advantage in memory bandwidth over the Exynos 5433 and runs its CCI at 787 MHz, giving the CPU port a theoretical 12.6 GB/s that's much more than the 6.6GB/s of the 5433, the actual measured bandwidth difference is much less and is nowhere near that figure in any of the sub-tests.
To look at this in more detail, we use AndEBench's memory benchmarks, and indeed we see a similar result.
It's the memory latency test in particular that's very worrisome, as the MDP tablet achieves a very bad throughput score. We're not sure why this happens, but we hope to investigate this further in the future when we get the chance to review a shipping Snapdragon 810 device.
Continuing on, let's look at our browser bench suite.
Here, the Snapdragon 810 is off to quite a start. While not a direct correlation, performance in these benchmarks can generally be correlated with CPU performance. The Snapdragon 810 shines here and approaches the Nexus 9, which has a strong showing due to the underlying Denver CPU's code optimizer unrolling loops in the benchmark.
In Octane, we see that the Snapdragon 810 continues to be competitive with some of the fastest SoCs available today.
On the other hand, in WebXPRT we see that performance ends up somewhere around the level of the Snapdragon 801. It's possible that we're looking at thermal throttling or some other issue here as I was unable to run multiple trials of this test. Our browser based tests are otherwise generally consistent with what we found earlier this week on the A57-based Exynos 5433, so it's unlikely we're looking at an unoptimized Chrome build.
Continuing with the benchmarks we were able to perform at the performance preview, BaseMark OS II is next.
The System numbers of the Snapdragon 810 MDP/T seem disappointing, as it falls at the lower end of our current flagship device lineup. It seems the reference platform isn't as well optimized as it should be.
Similar to the Nexus 9, we see some odd trends in the memory tests. It isn't quite clear what's causing this, but a performant eMMC is certainly a possibility. Due to this test being very device specific, we can't really judge the Snapdragon 810's performance here.
As for BaseMark, what we get are very binary results with the 810 either coming in near the top or bottom. The overall score still looks quite good as a result, boosted by things such as chart-topping graphics performance. On the other hand the web and system scores are struggling, coming in 20% or more behind the Note 4 and it's A57 based Exynos 5433. Since this is an early device this may be a case of early teething issues with performance, possibly with optimizations or the OS, but at this point in time it's difficult to confirm anything.
Meanwhile looking at how Qualcomm's reference platform compares to the Snapdragon 805-based Nexus 6, we find some significant performance gains at times. Though Krait has held up admirably against its A15 based competition, A57 finally provides a solid jump in performance over what even the fastest Krait can offer.
GPU Performance
Last but certainly not least, we have GPU performance. As we mentioned earlier, the Snapdragon 810 introduces Qualcomm's Adreno 430, the latest member of the Adreno 400 GPU family. Qualcomm's own performance estimates call for a 30% increase over Adreno 420, with a final GPU clock of 600MHz being identical to the Snapdragon 805's (Adreno 420) own GPU clock speed.
From an architectural standpoint Adreno continues to be something of a black box for us. Other than being a modern OpenGL ES 3.1/AEP design, we don't know too much about how the GPU is laid out, and Qualcomm's current legal battle with NVIDIA likely not helping matters. In any case, Qualcomm has indicated that Adreno 430 is not just a simple extension of Adreno 420, so we may be looking at an architectural change such as wider shader blocks.
For today's benchmarks, as we mentioned before we only had a limited amount of time with the Snapdragon 810 and had issues with BaseMark X. We've had to pare down our GPU benchmarks to just 3DMark 1.2 and GFXBench 3.0. Once we get final hardware in, we will be able to run a wider array of graphics benchmarks on Snapdragon 810.
Starting off with 3DMark, compared to the Snapdragon 805 reference platform the actual graphics performance advantage is even greater than 30%, coming in at closer to 65%. However since drivers play a big role in this, a more recent 805 platform like the Nexus 6 may be a better comparison point, in which case the gains are 33%, just a hair over Qualcomm's own baseline performance estimate. We also find that Snapdragon 810 oddly struggles at physics performance here, underperforming Snapdragon 805 devices, something the Exynos 5433 didn't have trouble with. As a result overall performance is only slightly improved over the Nexus 6.
Continuing with GFXBench, we look at more pure GPU loads. One has to take note that the MDP/T employs a 4K screen resolution, and the on-screen results will likely suffer from that.
Under GFXBench 3.0's full rendering tests of Manhattan and T-Rex, the Snapdragon 810 continues to show considerable performance gains over the Snapdragon 805. Ignoring the onscreen results for now since the Snapdragon 810 reference platform runs at such a high resolution, offscreen results show the 810 outperforming the 805 by 33% in Manhattan and 16% in T-Rex. The former is again well in-line with Qualcomm's performance estimate, wile the older T-Rex benchmark doesn't show the same gains, possibly indicating that Adreno 430's biggest gains are going to come from shader-bound scenarios.
Meanwhile GFXBench's synthetic tests continue to put Adreno 430 and the Snapdragon 810 in a good light. ALU performance in particular is showing very large gains - 46% better than the Snapdragon 805 and Adreno 420 - while our blending and fillrate tests show almost no gain over Snapdragon 805. This adds further credence to our theory that Qualcomm has widened or otherwise improved Adreno's shader blocks for 430, as other elements of the GPU are not showing significant performance changes.
Finally, GFXBench's driver overhead and accuracy tests are more or less what we would expect for Snapdragon 810. In the case of driver overhead, a combination of newer drivers and a much faster CPU have reduced the CPU cost of driver overhead. Meanwhile with the underlying GPU architecture being unchanged, there are no material changes to quality/accuracy.
Overall then the performance gains for the Adreno 430 and Snapdragon 810 seem to be almost exclusively focused on shader performance, but in those cases where rendering workloads are shader bound, Qualcomm's 30% estimate is on the mark. Real-word performance gains meanwhile are going to depend on the nature of the workload; games and applications that are similarly shader-bound should see good performance gains, while anything that's bottlenecked by pixel throughput, texturing, or front-end performance will see much smaller gains. Thankfully for Qualcomm most high-end workloads are indeed shader bound, and this is especially the case when pushing high resolutions, as Qualcomm is trying to do with their 4K initiative for Snapdragon 810. However in the case of 4K, while Adreno 430 offers improved performance it's still slow enough that it's going to struggle to render any kind of decently complex content at that resolution.
As for Adreno 430 versus the competition, Qualcomm has narrowed much of the gap between themselves and NVIDIA/Apple, but they haven't closed it. Apple's Imagination GX6850 and NVIDIA's K1 GPUs continue to hold a performance advantage, particularly in GFXBench's Manhattan and T-Rex full rendering tests. Both Apple and NVIDIA invested significant die space in graphics, and while we don't know how much Qualcomm has invested in Adreno 430 with Snapdragon 810, it's safe to say right now that they would need to invest even more if they want to beat the graphics performance of NVIDIA and Apple's tablet SoCs.
Final Words
In light of everything, it seems that Snapdragon 810 was not as the rumors claimed. In my experience, I didn’t notice any of the development devices getting hotter than what I’d come to expect from a modern SoC. In most cases, it appears that CPU performance is about what we’d expect from a cluster of four Cortex A57s at 2 GHz, although there are a few anomalous results that could be a concern. If anything, it’s clear that the CPU isn’t really an area of weakness on the Snapdragon 810, especially with all of the work that Qualcomm has done for an energy aware scheduler to maximize the performance and efficiency of their big.LITTLE implementation.
Outside of the CPU, it’s evident that Qualcomm will retain their traditional lead in the modem and RF space, as OEMs will continue to adopt parts of RF360 along with Qualcomm modems and transceivers to ensure maximum performance on flagship smartphones and other high-end mobile devices. I don’t believe any other company will really be able to beat Qualcomm in this space, as they strongly emphasized just how well-validated their modems are and the extent to which they implement standards properly to work with operators around the world without issue.
While my time with the Snapdragon 810 hasn’t revealed any significant issues, the real concern here seems to be more along the lines of the GPU performance. While ALU performance and compute performance in general are significantly improved with the Adreno 430, the performance uplift doesn’t really seem to be as large as one might hope. Although Qualcomm is trying to sell the idea of a 4K tablet with the Snapdragon 810, it feels as if it’s too early to try and drive such high resolutions when the GPU can’t handle it. In order to see an appreciable increase in performance this year, it’s likely that OEMs will need to stay with 1080p or at most QHD display resolutions to really deliver improved graphics performance for gaming and other GPU intensive use cases.
As we’ve mentioned before, it seemed that Qualcomm stumbled a bit with the launch of Apple’s A7 SoC. While it seemed that Snapdragon 810 might have relatively little competitive advantage over other SoCs, in the past few months it’s become clear that Qualcomm has been leveraging their strengths to ensure that they remain a strong choice for SoCs this year. Although the GPU and memory subsystem appear to be a bit weak, overall 2015 remains promising for Android flagships, even if an OEM can’t design their own SoC.