Original Link: https://www.anandtech.com/show/17102/snapdragon-8-gen-1-performance-preview-sizing-up-cortex-x2
The Snapdragon 8 Gen 1 Performance Preview: Sizing Up Cortex-X2
by Dr. Ian Cutress on December 14, 2021 8:00 AM ESTAt the recent Qualcomm Snapdragon Tech Summit, the company announced its new flagship smartphone processor, the Snapdragon 8 Gen 1. Replacing the Snapdragon 888, this new chip is set to be in a number of high performance flagship smartphones in 2022. The new chip is Qualcomm’s first to use Arm v9 CPU cores as well as Samsung’s 4nm process node technology. In advance of devices coming in Q1, we attended a benchmarking session using Qualcomm’s reference design, and had a couple of hours to run tests focused on the new performance core, based on Arm’s Cortex-X2 core IP.
The Snapdragon 8 Gen 1
Rather than continue with the 800 naming scheme, Qualcomm is renaming its smartphone processor portfolio to make it easier to understand / market to consumers. The Snapdragon 8 Gen 1 (hereafter referred to as S8g1 or 8g1) will be the headliner for the portfolio, and we expect Qualcomm to announce other processors in the family as we move into 2022. The S8g1 uses the latest range of Arm core IP, along with updated Adreno, Hexagon, and connectivity IP including an integrated X65 modem capable of both mmWave and Sub 6 GHz for a worldwide solution in a single chip.
While Qualcomm hasn’t given any additional insight into the Adreno / graphics part of the hardware, not even giving us a 3-digit identifier, we have been told that it is a new ground up design. Qualcomm has also told us that the new GPU family is designed to look very similar to previous Adreno GPU sfrom a feature/API standpoint, which means that for existing games and other apps, it should allow a smooth transition with better performance. We had time to run a few traditional gaming tests in this piece.
On the DSP side, Qualcomm’s headlines are that the chip can process 3.2 Gigapixels/sec for the cameras with an 18-bit pipeline, suitable for a single 200MP camera, 64MP burst capture, or 8K HDR video. The encode/decode engines allow for 8K30 or 4K120 10-bit H.265 encode, as well as 720p960 infinite recording. There is no AV1 decode engine in this chip, with Qualcomm’s VPs stating that the timing for their IP block did not synchronize with this chip.
AI inference performance has also quadrupled - 2x from architecture updates and 2x from software. We have a couple of AI tests in this piece.
As usual with these benchmarking sessions, we’re very interested in what the CPU part of the chip can do. The new S8g1 from Qualcomm features a 1+3+4 configuration, similar to the Snapdragon S888, but using Arm’s newest v9 architecture cores.
- The single big core is a Cortex-X2, running at 3.0 GHz with 1 MiB of private L2 cache.
- The middle cores are Cortex-A710, running at 2.5 GHz with 512 KiB of private L2 cache.
- The four efficiency cores are Cortex-A510, running at 1.8 GHz and an unknown amount of L2 cache. These four cores are arranged in pairs, with L2 cache being private to a pair.
- On the top of these cores is an additional 6 MiB of shared L3 cache and 4 MiB of system level cache at the memory controller, which is a 64-bit LPDDR5-3200 interface for 51.2 GB/s theoretical peak bandwidth.
Compared to the Snapdragon S888, the X2 is clocked higher than the X1 by around 5% and has additional architectural improvements on top of that. Qualcomm is claiming +20% performance or +30% power efficiency for the new X2 core over X1, and on that last point it is beyond the +16% power efficiency quoted by Samsung moving from 5nm to 4nm, so there are additional efficiencies Qualcomm is implementing in silicon to get that number. Unfortunately Qualcomm would not go into detail what those are, nor provide details about how the voltage rails are separated, if this is the same as S888 or different – Arm has stated that the X2 core could offer reduced power than the X1, and if the X2 is on its own voltage rail that could provide support for Qualcomm’s claims.
The middle A710 cores are also Arm v9, with an 80 MHz bump over the previous generation likely provided by process node improvements. The smaller A510 efficiency cores are built as two complexes each of two cores, with a shared L2 cache in each complex. This layout is meant to provide better area efficiency, although Qualcomm did not explain how much L2 cache is in each complex – normally they do, but for whatever reason in this generation it wasn’t detailed. We didn’t probe the number in our testing here due to limited time, but no doubt when devices come to market we’ll find out.
On top of the cores is a 6 MiB L3 cache as part of the DSU, and a 4 MiB system cache with the memory controllers. Like last year, the cores do not have direct access to this 4 MiB cache. We’ve seen Qualcomm’s main high-end competitor for next year, MediaTek, showcase that L3+system cache will be 14 MiB, with cores having access to all, so it will be interesting to see how the two compare when we have the MTK chip to test.
Benchmarking Session: How It Works
For our benchmarking session, we were given a ‘Qualcomm Reference Device’ (QRD) – this is what Qualcomm builds to show a representation of how a flagship featuring the processor might look. It looks very similar to modern smartphones, with the goal to mirror something that might come to market in both software and hardware. The software part is important, as the partner devices are likely a couple of months from launch, and so we recognize that not everything is final here. These devices also tend to be thermally similar to a future retail example, and it’s pretty obvious if there was something odd in the thermals as we test.
These benchmark sessions usually involve 20-40 press, each with a device, for 2-4 hours as needed. Qualcomm preloads the device with a number of common benchmarking applications, as well as a data sheet of the results they should expect. Any member of the press that wants to sideload any new applications has to at least ask one of the reps or engineers in the room. In our traditional workflow, we sideload power monitoring tools and SPEC2017, along with our other microarchitecture tests. Qualcomm never has any issue with us using these.
As with previous QRD testing, there are two performance presets on the device – a baseline preset expected to showcase normal operation, and a high performance preset that opportunistically puts threads onto the X2 core even when power and thermals is quite high, giving the best score regardless. The debate in smartphone benchmarking of initial runs vs. sustained performance is a long one that we won’t go into here (most noticeably because 4 hours is too short to do any extensive sustained testing) however the performance mode is meant to enable a ‘first run’ score every time.
Testing the Cortex-X2: A New Android Flagship Core
Improving on the Cortex-X1 by switching to the Arm v9 architecture and increasing the core resources, both Arm and Qualcomm are keen to promote that the Cortex-X2 offers better performance and responsiveness than previous CPU cores. The small frequency bump from 2.85 GHz to 3.00 GHz will add some of that performance, however the question is always if the new manufacturing process coupled with the frequency increase allows for better power efficiency when running these workloads. Our standard analysis tool here is SPEC2017.
Running through some of these numbers, there are healthy gains to the core, and almost everything has a performance lift.
On the integer side (from 500.perlbench to 557.xr), there are good gains for gcc (+17%), mcf (+13%), xalancbmk (+13%), and leela (+14%), leading to an overall +8% improvement. Most of these integer tests involve cache movement and throughput, and usually gains in sub-tests like gcc can help a wide range of regular user workloads.
Looking at power and energy for the integer benchmarks, we’re seeing the X2 consume more instantaneous power on almost all the tests, but the efficiency is kicking in. That overall 8% performance gain is taking 5% less total energy, but on average requires 2% more peak power.
If we put this core up against all the other performance cores we test, we see that 8% jump in performance for 5% less energy used, and the X2 stands well above the X1 cores of the previous generation, especially those in non-Snapdragon processors. There is still a fundamental step needed to reach the Apple cores, even the previous-generation A14 performance core, which scores 34% higher for the same energy consumed (albeit on average another 34% peak power).
Just on these numbers, Qualcomm’s +20% performance or +30% efficiency doesn’t bare fruit, but the floating point numbers are significantly different.
Several benchmarks in 2017fp are substantially higher on the X2 this generation. +17% on namd for example would point to execution performance increases, but +28% in parest, +41% in lbm and +20% in blender showcases a mix of execution performance and memory performance. Overall we’re seeing +19% performance, which is nearer Qualcomm’s 20% mark. Note that this comes with an almost identical amount of energy consumed relative to the X1 core in the S888, with a difference of just 0.2%.
The major difference however is the average power consumed. For example, our biggest single test gain in 519.lbm is +41%, but where the S888 averages 4.49 watts, the new X2 core averages 7.62 watts. That’s a 70% increase in instantaneous power consumer, and realistically no single core in a modern smartphone should draw that much power. The reason why the power goes this high is because lbm leverages the memory subsystem, especially that 6 MiB L3 cache and relies on the 4 MiB system level cache, all of which consumes power. Overall in the lbm test, the +41% performance costs +20% energy, so efficiency is still +16% in this test. Some of the other tests, such as parest and blender, also follow this pattern.
Comparing against the competition, the X2 core does make a better generation jump when it comes to floating point performance. It will be interesting to see how other processors enable the X2 core, especially MTK’s flagship at slightly higher frequency, on TSMC N4, but also if it has access to a full 14 MiB combination of caches as we suspect, that could bring the power draw during single core use a lot higher. It will be difficult to tease out exactly who wins what where based on implementation vs. process node, but it will be a fun comparison to make when we look purely at the X2 vs. X2 cores.
Unfortunately due to how long SPEC takes to run (1h30 on the X2), we were unable to test on the A710/A510. We’ll have to wait to see when we get a retail unit.
Machine Learning: MLPerf and AI Benchmark 4
Even as a new benchmark in the space, MLPerf has been made available that runs representative workloads on devices and takes advantage of both common ML frameworks such as NNAPI as well as the respective chip libraries for each vendor. Using this benchmark on retail phones to date, Qualcomm has had the lead in almost all the tests, but given that the company is promoting a 4x increase in AI performance, it will be interesting to see if that comes across all of MLPerf’s testing scenarios.
It should be noted that Apple’s CoreML is currently not supported, hence the lack of Apple numbers here.
Across the board in these first four tests Qualcomm is making a sizable lead, going above and beyond what the S888 can do. Here we’re seeing up to a 2.2x result, making an average +75% gain. It’s not quite the 4x that Qualcomm promoted in its materials, but there’s a sizable gap with the other high-end silicon we’ve tested to date.
The only non-lead is with the language processing, where Google’s Tensor SoC is almost 2x what the S8g1 scores. This test is based on a mobileBERT model, and either for software or architecture reasons, it fits a lot better into the Google chip than any other. As smartphones increase their ML capabilities, we might see some vendors optimizing for specific workloads over others, like Google has, or offering different accelerator blocks for different models. The ML space is also fast paced, so perhaps optimizing for one type of model might not be a great strategy long-term. We will see.
In AI Benchmark 4, running in pure NNAPI mode, the Qualcomm S8g1 takes a comfortable lead. Andrei noted in previous reviews with this test that the power consumed during this test can be quite high, up to 14 W, and this is where some chips might be able to pull ahead an efficiency advantage. Unfortunately we didn’t record power at the same time as the test, but it would be good to monitor this in the future.
System-Wide Testing and Gaming
For our system wide tests, we had time to go through Geekbench 5, PCMark, and GFXBench. For workload based testing, we see performance uplifts with the S8g1, and it is a noting that here we tested PCMark with both performance mode on and off, which gave a +10% increase in the score – we’ve seen this before running PCMark on both Arm and x86 devices where turbo and favored cores can have large effects on scores. By contrast, GB5 scored the same.
In our PCMark tests, it's clear who the new ruler of the roost is.
On the graphics side, Qualcomm’s new number-less Adreno that is advertised as being ‘new from the ground up (but we won’t tell you how)’ again offers generational improvements for next year’s Android flagships. Qualcomm historically also offers better graphics performance per watt, so we’ll have to wait until we get the devices on hand to showcase that data. But overall, the gains in these tests show a large +50% performance jump over previous generation S888 graphics performance. In 2022, we'll have MediaTek’s flagship trying to aim for the same market but based on the Mali GPU, and graphics is an area that Qualcomm historically outpaces Mali designs quite easily. The only serious competitor in this space is Apple.
Conclusionary Remarks: Arm v9 for Android
When we move through significant revisions of Arm’s architecture, up to v8 and now v9, it’s important to note that the new features defined in the ISA do not always fundamentally improve performance – it’s up to the microarchitecture teams to build the cores to the ISA specifications, and the implementation teams to enable the core in silicon with frequency and power efficiency. Accomplishing that requires a good process node, design technology co-optimization, and then partners that can execute by building the best devices for that processor.
Qualcomm’s target with the Snapdragon 8 Gen 1 is very clearly the 2022 Android Flagship smartphones. New cores, new graphics, enhanced machine learning capabilities, a step function in camera processing power, an integrated X65 modem, all built on Samsung’s 4nm process node technology. The flagship Android space is an area in which Qualcomm has been comfortable for a number of years, however the increased thermals of last generation’s Snapdragon S888 gave a number of analysts in the space a bit of a squeaky bum moment.
It’s hard to tell immediately in our small test if that still remains the case. Samsung’s 4nm node has improvements beyond the previous generation 5nm design, however Qualcomm’s presentational numbers were above and beyond those that Samsung provided, perhaps indicating that additional improvements both in architecture and implementation have led to those performance numbers.
Our testing shows +19% floating point performance on the X2 core, which is almost the +20% that Qualcomm quotes, but only +8% in integer, which is often the most quoted. We’re seeing power efficiency improvements for sure on the X2 core, with an overall efficiency improvement of 17%, but peak power has also increased, in part because some of our tests make use of the additional cache in the system. Our machine learning tests are +75% over the previous generation, although not the 4x numbers that Qualcomm states – we need to do more work here on power efficiency testing however. On the gaming side, our 'first run' numbers showcase some explosive gains in GPU throughput.
Although we’ve only done a few tests here, I would be remiss if I didn’t mention the elephant in the room: MediaTek. In the last month MediaTek announced a return to the high-end with a flagship processor of its own, using the same 1+3+4 configuration with slightly higher frequencies, more cache, and built on TSMC’s N4 process. Implementation here will be the key metric I feel, so how MediaTek has been able to optimize for TSMC N4 vs Qualcomm on Samsung 4nm is going to be analyzed. I should point out here that a processor is more than just the CPU cores, as we’ll see Adreno vs Mali on graphics, the different machine learning approaches, but also how the two companies approach 5G and connectivity, which has been one of Qualcomm’s most prominent strengths to date.
We look forward to testing the Qualcomm S8g1 in more detail in the New Year, as well as how many of the main smartphone OEMs choose Qualcomm for their flagship devices.