Name: Improving The Exynos 9810 Galaxy S9: Part 2 - Catching Up With The Snapdragon
Item: Improving The Exynos 9810 Galaxy S9: Part 2 - Catching Up With The Snapdragon
Author: Andrei Frumusanu

Original Link: https://www.anandtech.com/show/12620/improving-the-exynos-9810-galaxy-s9-part-2

Improving The Exynos 9810 Galaxy S9: Part 2 - Catching Up With The Snapdragon

VIEW ARTICLE

by Andrei Frumusanu on April 20, 2018 9:00 AM EST

76 Comments

Following our review of the Galaxy S9 there’s been a lot of discussion about both the performance and battery life of Exynos 9810 variants of the Galaxy S9. In the original review I had identified a few key issues with the platform for which I had deemed to be the most negatively attributing to the bad characteristics of the phone. In a first piece following the review I did a few minor changes to the kernel which already seemed to have benefited battery life in our web browsing test, and slightly changing the performance characteristics of the phone for the positive.

In that previous article I noted that there’s a lot to be done to improve the performance of the phone further and trying to optimise battery life. Especially on the performance side of things there were in my opinion very low-hanging fruit in terms of possible changes that would benefit the user-experience.

Focusing on Performance

For this second part I set about trying to recover the best performance possible and matching the Snapdragon 845 variant of the Galaxy S9, while still keeping an eye on battery life.

Samsung Galaxy S9 (E9810) Kernel Comparison and Changelog
Version		Changes and Notes
Official Firmware	As Shipped	- Stock setup and behavior - Single Core M3 at 2704 MHz - Dual Core M3 at 2314 MHz - Quad Core M3 at 1794 MHz
	'CPU Limited Mode'	- Optional Samsung-defined CPU Mode in Settings - CPU limited to 1469 MHz - Memory controller at half-speed - Conservative Scheduler
Custom Config 1		- Start with 'As Shipped' Firmware - Remove hotplugging mechanism - Limit M3 frequency peak to 1794MHz at any loading
Custom Config 2 (Kernel source)		- Raise little core frequency to 1950MHz - Raise big core minimum frequency to 962MHz - Adapt EAS cost tables based on measured perf & power - Merge scheduler patches to 4.9-eas-dev (Up to Jan18) - Backport PELT util_est and use it - Backport PELT decay rate change to 16ms - Adapt/disable no longer needed Samsung sched(util) mods - Minor custom modifications for tuning
Custom Config 3		- Raise big core frequency to 2314MHz & relevant adjustments

As a starting point we’re continuing on where we left off in part 1, which was extremely straightforward as the only changes were the removal of all boost frequencies above 1.8GHz on the M3 cores and disabling the online core / hotplugging driver.

In the original review the most evident issue that I identified in terms of badly affecting performance of the phone was the way the device was extremely slow in terms of scaling up in frequency, as well as migrating threads onto the big cores. The original values I described were around 410ms for a steady state continuous workload to actually reach the maximum frequency of the big cores. This was a great contrast to the 65ms of the Snapdragon 845 variant. Setting all other things aside this is what was limiting the interactive performance of the Exynos 9810 the most, so naturally it’s what we want to fix first and foremost.

Scheduling history around EAS

As a little backstory, ever since big.LITTLE’s introduction several years ago the biggest goal for ARM has been to have SoC vendors run the heterogeneous CPUs with a smart scheduler that would be aware of the various CPU’s performance and energy characteristics. This was a fine goal to have but the road to get there has been in my opinion nothing short of a mess. ARM’s approach was to try to do the work in the upstream Linux kernel or within the Linaro workgroup kernel. Unfortunately over the years and delays a lot of the hype that energy aware scheduling (EAS) would bring ended up with a fizzle when it came to shipping commercial devices. I think Qualcomm was on the ball here as even as early as 2015 for the Snapdragon 810, and we’ve covered extensively what the company was trying to do to resolve issues relating to EAS.

A key component to enabling scheduling across heterogeneous CPUs is the ability for the scheduler to actually know the activity and load of individual tasks, instead of only knowing the general CPU utilisation. If you know an individual task’s load, then you can make batter scheduling decisions on which CPU cores to place it. This was originally implemented through the PELT mechanism (Per-entity load tracking) into the Linux kernel and is what was used for migration decisions both in HMP and EAS scheduling.

Exynos 9810 Floor Plan. Image Credit TechInsights

Another long-running goal of Arm and the Linux community was to integrate CPU frequency selection logic within the scheduler, instead of it being a separate mechanism. This was first attempted in a project called schedfreq, and is now fully integrated into a new governor called schedutil. Again the implementation time-scale we’re talking about here was several years, while at the same time we’re seeing several device generations being shipped with a myriad of solutions.

S.LSI’s Exynos chipsets were playing it safe, and up to the Exnyos 9810 the company just chose to stick to a HMP scheduler with a separate interactive cpu frequency governor. Huawei Kirin chipsets ship with EAS, however here even with the latest devices such as the P20, the company foregoes the scheduler CPU frequency governors and falls back to a traditional interactive one (with very good results). Meanwhile Qualcomm has advanced their custom implementation and taken another approach called WALT (Window-assisted load tracking) that is far more responsive to PELT. On the Snapdragon 835 and 845 this is the core mechanism that assures the best performance in terms of scheduling and CPU frequency selection.

Scheduler mechanisms: WALT & PELT

Over the years, it seems Arm noticed the slow progress and now appears to be working more closely with Google in developing the Android common kernel, utilizing out-of-tree (meaning outside of the official Linux kernel) modifications that benefit performance and battery life of mobile devices. Qualcomm also has been a great contributor as WALT is now integrated into the Android common kernel, and there’s a lot of work going on from these parties as well as other SoC manufacturers to advance the platform in a way that benefits commercial devices a lot more.

Samsung LSI’s situation here seems very puzzling. The Exynos 9810 is the first flagship SoC to actually make use of EAS, and they are basing the BSP (Board support package) kernel off of the Android common kernel. The issue here is that instead of choosing to optimise the SoC through WALT, they chose to fall back to full PELT dictated task utilisation. That’s still fine in terms of core migrations, however they also chose to use a very vanilla schedutil CPU frequency driver. This meant that the frequency ramp-up of the Exynos 9810 CPUs could have the same characteristics as PELT, which means it would be also bring with it one of the existing disadvantages of PELT: a relatively slow ramp-up.

Source: BKK16-208: EAS

Source: WALT vs PELT : Redux – SFO17-307

One of the best resources on the issue actually comes from Qualcomm, as they had spearheaded the topic years ago. In the above presentation presented at Linaro Connect 2016 in Bangkok, we see the visual representation of the behaviour of PELT vs WinLT (which WALT was called at the time). The metrics to note here in the context of the Exynos 9810 are the util_avg (which is the default behaviour on the Galaxy S9) and the contrast to WALT’s ravg.demand and actual task execution. So out of all the possible options in terms of BSP configurations, Samsung seemed to have chosen the worst one for performance. And I do think this seems to have been a conscious choice as Samsung had made additional mechanisms to the both the scheduler (eHMP) and schedutil (freqvar) to counteract this very slow behaviour caused by PELT.

In trying to resolve this whole issue, instead of adding additional logic on top of everything I looked into fixing the issue at the source.

What was first tried is perhaps the most obvious route, and that's to enable WALT and see where that goes. While using WALT as a CPU utilisation signal for the Exynos S9 gave outstandingly good performance, it also very badly degraded battery life. I had a look at the Snapdragon 845 Galaxy S9’s scheduler, but here it seems Qualcomm diverges significantly from the Google common kernel which the Exynos is based on. This being far too much work to port, I had another look at the Pixel 2’s kernel – which luckily was a lot nearer to Samsung’s. I ported all relevant patches which were also applied to the Pixel 2 devices, along with porting EAS to a January state of the 4.9-eas-dev branch. This improved WALT’s behaviour while keeping performance, however there was still significant battery life degradation compared to the previous configuration. I didn’t want to spend more time on this so I looked through other avenues.

Source : LKML Estimate_Utilization (With UtilEst)

Looking through Arm's resources, it looks very much like the company is aware of the performance issues and is actively trying to improve the behaviour of PELT to more closely match that of WALT. One significant change is a new utilisation signal called util_est (Utilisation estimation) which is added on top of WALT and is meant to be used for CPU frequency selection. I backported the patch and immediately saw a significant improvement in responsiveness due to the higher CPU frequency state utilisation. Another simple way of improving PELT was reducing the ramp/decay timings, which incidentally also got an upstream patch very recently. I backported this as well to the kernel, and after testing a 8ms half-life setting for a bit and judging it to not be good for battery life, I settled on a 16ms settings, which is an improvement over the 32ms of the stock kernel and gives the best performance and battery compromise.

Because of these significant changes in the way the scheduler is fed utilisation statistics, the existing tuning from Samsung obviously weren’t valid anymore. I adapted most of them to the best I could, which basically involves just disabling most of them as they were no longer needed. Also I significantly changed the EAS capacity and cost tables, as I do not think that the way Samsung populated the table is correct or representative of actual power usage, which is very unfortunate. Incidentally, this last bit was one of the reasons that performance changed when I limited the CPU frequency in part 1, as it shifted the whole capacity table and changed the scheduler heuristic.

But of course, what most of you are here for is not how this was done but rather the hard data on the effects of my experimenting, so let's dive into the results.

Performance & Battery Results

System performance was the key concern for the Exynos 9810 Galaxy S9. Having addressed the biggest issues in terms of scheduler and DVFS scaling, it’s time to check how this impacts our benchmark results. Again I like to comment that the following figures are not the best scores that the device achieved; but they are with the configuration that in my opinion best balanced performance and battery life given the time invested.

To recap what we’re looking at: the original firmware S9 (E9810) scores showcase the unmodified behaviour; the custom 1 modifications simply limit performance to 1794MHz on the M3 cores and disable the higher clock boost modes. Custom 2 contains the more major kernel modifications which we've covered on the first page while maintaining the 1794MHz maximum clocks.

Custom 3 configuration maintains all mechanisms but raises the clock back to 2314MHz – a clock which based on some investigation in the kernel source code may have been the original maximum designed frequency the SoC, and incidentally the last frequency before major diminishing performance/power returns in terms of voltage scaling.

PCMark Work 2.0 - Web Browsing 2.0

In PCMark’s web browsing test 2.0 we see a major improvement from all of the custom configurations. The first variant’s performance boost was caused by the capacity scale shift towards the big cores and thus having more workloads migrated onto the M3’s. Custom 2 & 3 retained similar performance however the performance boost here over the stock configuration comes from the increased scheduler and DVFS responsivity. Raising clockspeeds further to 2.3GHz didn’t bring much improvement – improving performance here sees diminishing returns and making the cores more aggressive brings a larger power and battery degradation.

PCMark Work 2.0 - Video Editing

PCMark Work 2.0 - Writing 2.0

As noted in part 1, I said I was confident being able to quickly recuperate some of the performance degradations of the first custom kernel. The PELT changes alongside the tuning did bring us back above the stock kernel in the video and writing tests. It’s interesting to see here the difference that the increased 2.3GHz clock brings – everything else being equal, the writing test gets a big boost in the score. I suspect this is because of the PDF rendering part of the test which is more compute intensive and just outright CPU throughput-bound.

PCMark Work 2.0 - Data Manipulation

The Data Manipulation test didn’t see major differences between custom 1 & 2 – this test didn’t seem to be significantly influenced by the scheduler changes, and the score only scaled slowly with further aggressive and power-hungry modifications. Raising the clock back to 2.3GHz in turn gave us a good linear increase, pointing out to more long-running tasks and CPU capacity limits.

PCMark Work 2.0 - Photo Editing 2.0

The Photo editing benchmark was the test most affected by the scheduler and DVFS changes. Here we finally find the explanation for the Exynos SoC’s bad performance in this test: the frequency just doesn’t scale fast enough with the heavy-but-short-lived tasks. This test scaled up to scores of 13000 at the most aggressive settings with WALT, but again at far too great power costs for it to be reasonable.

Source: WALT vs PELT : Redux – SFO17-307

Again, Qualcomm seems to be well aware of this, as they use PCMark as a demonstration of their custom scheduler modifications. I’m aware of concerns that this may be not representative of real-world uses and there might be too much a focus on PCMark, but in my experience this is not the case and I still think it’s overall the best representation we have on device “snappiness”.

Indeed, both configurations with the scheduler modifications made the S9 significantly more responsive and among one of, if not the most responsive devices. There’s a big “but” consideration for real world usage though – I haven’t tuned the touch boosters of the phone. These were naturally tuned for the slower stock behaviour of the SoC. In general the WALT and PELT modifications have the end-goal of completely avoiding the need of such input boosting mechanisms and to save on power. Therefor my subjective experience of the phone being so fast might be just a temporary thing and for battery optimization we might tone it down a bit through the removal of the input boosters. Unfortunately short of having a robotic arm and special benchmarks, this is all nearly impossible to objectively test.

Edit: I ended up trying a kernel with completely turned off input boosters and the phone still perfoms very well.

Another point I want to clarify when I talk about device responsiveness, is that I generally use phones without animations enabled as it just provides for a much faster UI experience. Under this use-case it’s very noticeable to see performance differences between devices, as the default animation durations can hide a lot of hiccups.

Speedometer 2.0 - OS WebView

In terms of web benchmarks we start with SpeedoMeter 2.0. This first test saw a smaller 7% boost from the new scheduler settings, so generally the performance here wasn’t an issue of performance responsiveness but rather of raw CPU capacity. Raising the clock back to 2.3GHz brings the performance back up near the Snapdragon 845 levels.

WebXPRT 3 - OS WebView

WebXPRT was also extremely sensitive to the scheduler settings, as we see a 10% increase on the custom 2 configuration. This was also one of the use-cases where the new PELT behaviour actually outperformed WALT as I was only able to reach a score of 76 with the former. Again the score here with the custom 2 configuration seems to be the highest reasonably-reachable performance under 1794MHz, and anything above that seems to be CPU capacity limited. The 2.3GHz configuration brought the performance back to Snapdragon 845 levels.

I also want to add that the performance levels in the web benchmarks here are at the highest achievable for the Exynos, and I don’t think any more scheduler or software optimisations will bring much. Especially as locking the cores to their maximum frequency doesn’t improve performance by much more.

Battery Life

The balance between performance and battery life is again a key aspect and issue of the phone and the SoC. I’ve tried myriads of configurations but simply wasn’t able to exceed the battery results of the first custom configuration. Unfortunately the physics here just overrule any possible software optimisation, and any performance increase will mean that we’re doing computations at higher performance states of the SoC, and thus we’ll be burning more energy per unit of work.

Web Browsing Battery Life 2016 (WiFi)

The custom 2 configuration brought a performance improvement, however this came at a cost of some battery life. This was a reasonable compromise and in my opinion the best-case scenario for the Exynos 9810 S9. Boosting the cores to 2.3GHz brought the 9810 to parity with the Snapdragon 845 in nearly all benchmarks, but this regressed battery life even further.

The juxtaposition between the stock results of the S9 versus the custom 3 configuration is interesting: We’re achieving similar battery life in both scenarios, just short of 7 hours. However the custom 3 configuration brings a 25-40% performance boost across a variety of tests. I might be beating a dead horse here, but again it shows that the Exynos 9810 could achieve just much better results depending on which direction one opted to optimise for, either performance or battery life.

PCMark Work 2.0 - Battery Life

I didn’t cover PCMark battery life in the first part but I wanted to make sure I cover all my bases for this piece. Here both custom 2 and 3 saw battery life regress versus the stock behaviour, but again in both cases that’s simply due to the higher overall performance. The delta from the increased 2.3GHz clock rate is a lot smaller here than in the web test, and that’s due to PCMark though being a lot more interactive in terms of continuous workloads. It has fewer really heavy tasks such as loading webpages, which use the higher frequency states of the M3s for relatively longer times.

Conclusion

With these pieces I wanted to see what’s possible with the Exynos 9810. There’s definitely still room for improvement; I’m still sure a properly tuned WALT configuration like on the Snapdragon 845 S9 or the Pixel 2 would further improve the performance or battery life of the Exynos S9. I didn’t want to go down that rabbit hole for a custom kernel, for now the improved PELT changes are just as good as it reasonably gets.

One thing I did discover is the performance discrepancy between the M3 and Kryo 385 when it comes to synthetic benchmarks versus some of the web benchmarks. While 1794 MHz is enough to match the A75-based CPU cores of the Snapdragon in GeekBench or SPEC, I wasn’t able to match the higher performance in the web benchmarks unless I raised the clocks to around 2.3GHZ. I can now dismiss software as being the main culprit here, and instead there’s more fingers pointing at the micro-architecture of the M3. This has some relatively big repercussions as it begs the question of what kind of workload is actually more representative of overall Android smartphone use-cases.

The above graphic is my best guess on what the performance/power curves look like. These are based on scheduler cost tables, voltage curves and correlations to actual measured power on certain points. The big question here is what is the actual representative positioning between the two architectures in terms of performance? As we saw in part 1, the M3 can win on average in workloads such as SPEC at the same performance points as the S845. However to reach the higher performance of the 845 in web workloads we need to raise the clocks, and this of course would shift the efficiency curves around with a much bigger favour towards the Arm cores. The average is probably somewhere in-between, and Arm and Samsung hopefully have a more complete view in terms of workload characterization.

What is indisputable is that the M3 lags behind in the lower frequency states. Here, Samsung’s cores just stop scaling further down in voltage after 1170MHz, while the Snapdragon and Arm cores' power curves are just a lot steeper. Again the absolute difference is arguable depending on workloads, be it 25% or 100%. Unfortunately at this point we’re talking about insurmountable physics and there’s just no software optimisation which will overcome this.

In the end the Exynos S9 was hampered on two fronts: one being just a very unoptimised BSP (Board support package; kernel, drivers, etc) by S.LSI (With the Mobile Division also possibly being a factor), particularly the seemingly senseless chasing of higher synthetic benchmarks scores such as GeekBench. which in turn backfired very badly in any real-world workloads. Qualcomm provided Samsung with an excellent baseline BSP on the S845 S9’s – so for S.LSI not being able to do the same is just unfortunate. The other front where the Exynos S9 was hampered was that the M3 just seems oversized and power hungry, and it can’t sufficiently act as the efficient workhorse for general workloads. Compounding problems, this comes at a cost of battery life. Here there’s just a lot more to be done to fix the efficiency and the performance discrepancy relative to Arm’s cores.