Original Link: https://www.anandtech.com/show/13318/huawei-benchmark-cheating-headache



Section By Ian Cutress & Andrei Frumusanu

Does anyone remember our articles regarding unscrupulous benchmark behavior back in 2013? At the time we called the industry out on the fact that most vendors were increasing thermal and power limits to boost their scores in common benchmark software. Fast forward to 2018, and it is happening again.

Benchmarking Bananas: A Recap

cheat: verb, to act dishonestly or unfairly in order to gain an advantage.

AnandTech exposing benchmark cheating on smartphones has a long and rich history. It is quite apt that this story goes full circle, as the one to tip off Brian on Samsung’s cheating behaviour on the Exynos Galaxy S4 a few years back was Andrei, who now writes for us.

When we exposed one vendor, it led to a cascade of discussions and a few more articles investigating more vendor involved in the practice, and then even Futuremark delisting several devices from their benchmark database. Scandal was high on the agenda, and the results were bad for both companies and end users: devices found cheating were tarnishing the brand, and consumers could not take any benchmark data as valid from that company. Even reviewers were misled. It was a deep rabbit hole that should not have been approached – how could a reviewer or customer trust what number was coming out of the phone if it was not in a standard ‘mode’?

So thankfully, ever since then, vendors have backed off quite a bit on the practice. Since 2013, for several years it would appear that a significant proportion of devices on the market are behaving within expected parameters. There are some minor exceptions, mostly from Chinese vendors, although this comes in several flavors. Meizu has a reasonable attitude to this, as when a benchmark is launched the device puts up a prompt to confirm entering a benchmark power mode, so at least they’re open and transparent about it. Some other phones have ‘Game Modes’ as well, which either focus on raw performance, or extended battery life.

Going Full Circle, At Scale

So today we are publishing two front page pieces. This one is a sister article to our piece addressing Huawei’s new GPU Turbo, and while it makes overzealous marketing claims, the technology is sound. Through the testing for that article, we actually stumbled upon this issue, completely orthogonal to GPU turbo, which needs to be published. We also wanted to address something that Andrei has come across while spending more time with this year’s devices, including the newly released Honor Play.

The Short Detail

As part of our phone comparison analysis, we often employ additional power and performance testing on our benchmarks. While testing out the new phones, the Honor Play had some odd results. Compared to the Huawei P20 devices tested earlier in the year, which have the same SoC, the results were also quite a bit worse and equally weird.

Within our P20 review, we had noted that the P20’s performance had regressed compared to the Mate 10. Since we had encountered similar issues on the Mate 10 which were resolved with a firmware update pushed to me, we didn’t dwell too much on the topic and concentrated on other parts of the review.

Looking back at it now after some re-testing, it seems quite blatant as to what Huawei and seemingly Honor had been doing: the newer devices come with a benchmark detection mechanism that enables a much higher power limit for the SoC with far more generous thermal headroom. Ultimately, on certain whitelisted applications, the device performs super high compared to what a user might expect from other similar non-whitelisted titles. This consumes power, pushes the efficiency of the unit down, and reduces battery life.

This has knock-on effects, such as trust, in how the device works. The end result is a single performance number is higher, which is good for marketing, but is unrealistic to any user with the device. The efficiency of the SoC also decreases (depending on the chip), as the chip is pushed well outside its standard operating window. It makes the SoC, one of the differentiating points of the device, look worse, all for the sake of a high benchmark score. Here's the example of benchmark detection mode on and off on the Honor Play:

GFXBench T-Rex Offscreen Power Efficiency
(Total Device Power)
AnandTech Mfc. Process FPS Avg. Power
(W)
Perf/W
Efficiency
Honor Play (Kirin 970) BM Detection Off 10FF 66.54 4.39 15.17 fps/W
Honor Play (Kirin 970) BM Detection On 10FF 127.36 8.57 14.86 fps/W

We’ll go more into the benchmark data on the next page.

We did approach Huawei about this during the IFA show last week, and obtained a few comments worth putting here. Another element to the story is that Huawei’s new benchmark behavior very much exceeds anything we’ve seen in the past. We use custom editions of our benchmarks (from their respective developers) so we can test with this ‘detection’ on and off, and the massive differences in performance between the publicly available benchmarks and the internal versions that we’re using for testing is absolutely astonishing.

Huawei’s Response

As usual with investigations like this, we offered Huawei an opportunity to respond. We met with Dr. Wang Chenglu, President of Software at Huawei’s Consumer Business Group, at IFA to discuss this issue, which is purely a software play from Huawei. We covered a number of topics in a non-interview format, which are summarized here.

Dr. Wang asked if these benchmarks are the best way to test smartphones as a whole, as he personally feels that these benchmarks are moving away from real world use. A single benchmark number, stated Huawei’s team, does not show the full experience. We also discussed the validity of the current set of benchmarks, and the need for standardized benchmarks. Dr. Wang expressed his preference for a standardized benchmark that is more like the user experience, and they want to be a part of any movement towards such a benchmark.

I explained that we work with these benchmark companies, such as Kishonti (GFXBench) and Futuremark (3DMark), as well as others, to help steer them in a way that is better represented for benchmarking. We explained employing a benchmarking mode to game test results is not a solution to solving what they see as a misrepresentation of user experience with these benchmarks. This is especially valid when the chip ends up with lower efficiency – but to be honest with test: the only way for it to be better related to user experience is to run it in the standard power envelope that every regular game runs in.

Huawei stated that they have been working with industry partners for over a year to find the best tests closest to the user experience. They like the fact that for items like call quality, there are standardized real-world tests that measure these features that are recognized throughout the industry, and every company works towards a better objective result. But in the same breath, Dr. Wang also expresses that in relation to gaming benchmarking that ‘others do the same testing, get high scores, and Huawei cannot stay silent’.

He states that it is much better than it used to be, and that Huawei ‘wants to come together with others in China to find the best verification benchmark for user experience’. He also states that ‘in the Android ecosystem, other manufacturers also mislead with their numbers’, citing one specific popular smartphone manufacturer in China as the biggest culprit, and that it is becoming ‘common practice in China’. Huawei wants to open up to consumers, but have trouble when competitors continually post unrealistic scores.

Ultimately Huawei states that they are trying to face off against their major Chinese competition, which they say is difficult when other vendors put their best ‘unrealistic’ score first. They feel that the way forward is standardization on benchmarks, that way it can be a level field, and they want the media to help with that. But in the interim, we can see that Huawei has also been putting its unrealistic scores first too.

Our response to this is that Huawei needs to be a leader, not a follower on this issue. I explained that the benchmarks we use (GFXBench) are well understood and are ‘standard’, and as real world as possible, but there are benchmarks we don’t use (AnTuTu) because they don’t mean anything. We also use benchmarks such as SPEC, which are very standard in this space, to evaluate an SoC and device. 

The discussion then pivoted towards the decline in trust Huawei’s benchmark numbers in presentations as a result of this. We already take the data with a large grain of salt, but now we have no reason to listen to them as we do not know which values are in this ‘benchmark’ mode.

Huawei’s reaction to this is that they will ensure that future benchmark data in presentations is independently verified by third parties at the time of the announcement. This was the best bit of news.

Our Reaction

While not explicitly stated in a clear line, Huawei is admitting to doing what they are doing, citing specific vendors in China as the primary reason for it.

We understand the impact that higher marketing numbers, however this is the worst way to do it – rather than calling out the competition for bad practices, Huawei is trying to beat them at their own game, and it’s a game in which everyone loses. For a company the size of Huawei, brand image is a big part of what the company is, and trying to mislead customers just for a high-score will backfire. It has backfired.

Huawei’s comments about standardized benchmarking are not new – we’ve heard it since time immemorial in the PC space, and several years ago, Arm was similarly discussing it with the media. Since then the situation has gotten better: the canned benchmark companies speak with game developers to develop real-world scenarios, but they also want to push the boundaries.

The only thing that hasn’t happened in the mobile space compared to the PC space on this is proper in-game benchmark modes that output data properly. This is something that is going to have to be vendor driven, as our interactions with big gaming studios on in-game benchmarks typically falls flat. Any frame rate testing on mobile requires additional software, which can require root, however Huawei recently disabled the ability to root their phones. Though we're told that that at some point in the future, Huawei will be re-enabling rooting for registered developers soon.

Overall, while it’s positive that Huawei is essentially admitting to these tactics, we believe the reasons for doing so are flimsy at best. The best way to implement this sort of ‘mode’ is to make it optional, rather than automatic, as some vendors in China already do. But Huawei needs to lead from the front if it ever wants to approach Samsung in unit sales.

Huawei did not go into how the benchmarking detection will be addressed in current and future devices. We will come back to the issue for the Mate 20 launch on October 16th.



The Raw Benchmark Numbers

Section By Andrei Frumusanu

Before we go into more details, we're going to have a look at how much of a difference this behavior contributes to benchmarking scores. The key is in the differences between having Huawei/Honor's benchmark detection mode on and off. We are using our mobile GPU test suite which includes of Futuremark’s 3DMark and Kishonti’s GFXBench.

The analysis right now is being limited to the P20’s and the new Honor Play, as I don’t have yet newer stock firmwares on my Mate 10s. It is likely that the Mate 10 will exhibit similar behaviour - Ian also confirmed that he's seeing cheating behaviour on his Honor 10. This points to most (if not all) Kirin 970 devices released this year as being affected.

Without further ado, here’s some of the differences identified between running the same benchmarks while being detected by the firmware (cheating) and the default performance that applies to any non-whitelisted application (True Performance). The non-whitelisted application is a version provided to us from the benchmark manufacturer which is undetectable, and not publicly available (otherwise it would be easy to spot). 

3DMark Sling Shot 3.1 Extreme Unlimited - Graphics - Peak 

3DMark Sling Shot 3.1 Extreme Unlimited - Physics - Peak 

GFXBench Aztec High Off-screen VK - Peak 

GFXBench Aztec Normal Off-screen VK - Peak 

GFXBench Manhattan 3.1 Off-screen - Peak 

GFXBench T-Rex Off-screen - Peak

We see a stark difference between the resulting scores – with our internal versions of the benchmark performing significantly worse than the publicly available versions. We can see that all three smartphones perform almost identical in the higher power mode, as they all share the same SoC. This contrasts significantly with the real performance of the phones, which is anything but identical as the three phones have diferent thermal limits as a result of their different chassis/cooling designs. Consequently, the P20 Pro, being the largest and most expensive, has better thermals in the 'regular' benchmarking mode.

Raising Power and Thermal Limits

What is happening here with Huawei is a bit unusual in regards to how we’re used to seeing vendors cheat in benchmarks. In the past we’ve seen vendors actually raise the SoC frequencies, or locking them to their maximum states, raising performance beyond what’s usually available to generic applications.

What Huawei instead is doing is boosting benchmark scores by coming at it from the other direction – the benchmarking applications are the only use-cases where the SoC actually performs to its advertised speeds. Meanwhile every other real-world application is throttled to a significant degree below that state due to the thermal limitations of the hardware. What we end up seeing with unthrottled performance is perhaps the 'true' form of an unconstrained SoC, although this is completely academic when compared to what users actually expereience.

To demonstrate the power behaviour between the two different throttling modes, I measured the power on the newest Honor Play. Here I’m showcasing total device power at fixed screen brightness; for GFXBench the 3D phase of the benchmark is measured for power, while for 3DMark I’m including the totality of the benchmark run from start to finish (because it has different phases).

Honor Play Device Power - Default vs Cheating

The differences here are astounding, as we see that in the 'true performance' state, the chip is already reaching 3.5-4.4W. These are the kind of power figures you would want a smartphone to limit itself to in 3D workloads. By contrast, using the 'cheating' variants of the benchmarks completely explodes the power budget. We see power figures above 6W, and T-Rex reaching an insane 8.5W. On a 3D battery test, these figures very quickly trigger an 'overheating' notification on the device, showing that the thermal limits must be beyond what the software is expecting.

This means that the 'true performance' figures aren’t actually stable - they strongly depend on the device’s temperature (this being typical for most phones). Huawei/Honor are not actually blocking the GPU from reaching its peak frequency state: instead, the default behavior is a very harsh thermal throttling mechanism in place that will try to maintain significantly lower SoC temperature levels and overall power consumption.

The net result is that that in the phones' normal mode, peak power consumption during these tests can reach the same figures posted by the unthrottled variants. But the numbers very quickly fall back in a drastic manner. Here the device thottles down to 2.2W in some cases, reducing performance quite a lot.



Mea-Culpa: It Should Have Been Caught Earlier

Section By Andrei Frumusanu

As stated on the previous page, I had initially had seen the effects of this behaviour back in January when I was reviewing the Kirin 970 in the Mate 10. The numbers I originally obtained showed worse-than-expected performance of the Mate 10, which was being beaten by the Mate 9. When we discussed the issue with Huawei, they attributed it to a firmware bug, and pushed me a newer build which resolved the performance issues. At the time, Huawei never discussed what that 'bug' was, and I didn't push the issue as performance bugs do happen.

For the Kirin 970 SoC review, I went through my testing and published the article. Later on, in the P20 reviews, I observed the same lower performance again. As Huawei had told me before it was a firmware issue, I had also attributed the bad performance to a similar issue, and expected Huawei to 'fix' the P20 in due course.

Looking back in hindsight, it is pretty obvious there’s been some less than honest communications with Huawei. The newly detected performance issues were not actually issues – they were actually the real representation of the SoC's performance. As the results were somewhat lower, and Huawei was saying that they were highly competetive, I never would have expected these numbers as genuine.

It's worth noting here that I naturally test with our custom benchmark versions, as they enable us to get other data from the tests than just a simple FPS value. It never crossed my mind to test the public versions of the benchmarks to check for any discrepancy in behaviour. Suffice to say, this will change in our testing in the future, with numbers verified on both versions.

Analyzing the New Competitive Landscape

With all that being said, our past published results for Kirin 970 devices were mostly correct - we had used a variant of the benchmark that wasn’t detected by Huawei’s firmware. There is one exception however, as we weren't using a custom version of 3DMark at the time. I’ve now re-tested 3DMark, and updated the corresponding figures in past reviews to reflect the correct peak and sustained performance figures.

As far as I could tell in my testing, the cheating behaviour has only been introduced in this year’s devices. Phones such as the Mate 9 and P10 were not affected. If I’m to be more precise, it seems that only EMUI 8.0 and newer devices are affected. Based on our discussions with Huawei, we were told that this was purely a software implementation, which also corroborates our findings.

Here is the competitive landscape across our whole mobile GPU performance suite, with updated figures where applicable. We are also including new figures for the Honor Play, and the new introduction of the GFXBench 5.0 Aztec tests across all of our recent devices:

3DMark Sling Shot 3.1 Extreme Unlimited - Graphics 

3DMark Sling Shot 3.1 Extreme Unlimited - Physics 

GFXBench Aztec Ruins - High - Vulkan/Metal - Off-screen GFXBench Aztec Ruins - Normal - Vulkan/Metal - Off-screen 

GFXBench Manhattan 3.1 Off-screen 

GFXBench T-Rex 2.7 Off-screen

Overall, the graphs are very much self-explanatory. The Kirin 960 and Kirin 970 are lacking in both performance and efficiency compared almost every device in our small test here. This is something Huawei is hoping to address with the Kirin 980, and features such as GPU Turbo.



The Reality of Silicon And Market Pressure

Section By Andrei Frumusanu

In a sense, the Kirin 960 and Kirin 970 have been a welcome addition to our mobile testing suite. As a result of having devices powered by the two chipsets, we have switched over to a new testing methodology where we now always publish peak and sustained performance figures alongside each other. Without the behavior of these devices, we might never have changed our methods to catch these shenanigans.

But if we’re to go back to a paragraph in the Kirin 970 SoC piece:

Indeed, the Kirin 960 and 970’s vast discrepancies between peak performance and their inability to sustain those performance was one of the key reasons why for this year I opted change our mobile GPU performance testing methodology. All reviews this year were published with peak and sustained performance figures alongside each other, trying to unveil some of the more negative aspects of sustained performance among some of today’s smartphones.

The behaviour of this year’s Kirin 970 devices is, in a sense, not surprising. Huawei & Honor's power throttling adjustments are a great positive for the actual user-experience as they solve one of the key issues I had brought up about the chips in the review: they limit phone power consumption to reasonable levels, rather than burning through power and battery capacity like crazy. This new behavior on power throttling is essentially an aftershock to the Kirin 960’s awful GPU power characteristics. Somebody smart at Huawei decided that the high power draw was indeed not good, and they introduced a new strict throttling mechanism to keep temperatures in check.

This means that when we look at the efficiency table, it makes a lot of sense. Both chips showcase instantaneous power draws way above the sustainable levels for their form-factors, which the throttling mechanism keeps in check.

Competing Against Cheaters: Two Options

While I fully support Huawei in introducing the new throttling mechanisms, the big faux-pas here was in terms of them excluding benchmark applications via a whitelist. During the Kirin 950 days when we talked to HiSilicon’s managers, we discussed GPU power as an important topic even back then. Those generation chipsets had substantially lower GPU performance compared to the competition, however the GPU power was always within the sustainable thermal envelope of the phones – around 3.5W.

Now, when we look at total system power, we see that Huawei has made improvements:

GFXBench Manhattan 3.1 Offscreen Power Efficiency
(System Active Power)
AnandTech Mfc. Process FPS Avg. Power
(W)
Perf/W
Efficiency
Galaxy S9+ (Snapdragon 845) 10LPP 61.16 5.01 11.99 fps/W
Galaxy S9 (Exynos 9810) 10LPP 46.04 4.08 11.28 fps/W
Galaxy S8 (Snapdragon 835) 10LPE 38.90 3.79 10.26 fps/W
LeEco Le Pro3 (Snapdragon 821) 14LPP 33.04 4.18 7.90 fps/W
Galaxy S7 (Snapdragon 820) 14LPP 30.98 3.98 7.78 fps/W
Huawei Mate 10 (Kirin 970) 10FF 37.66 6.33 5.94 fps/W
Galaxy S8 (Exynos 8895) 10LPE 42.49 7.35 5.78 fps/W
Galaxy S7 (Exynos 8890) 14LPP 29.41 5.95 4.94 fps/W
Meizu PRO 5 (Exynos 7420) 14LPE 14.45 3.47 4.16 fps/W
Nexus 6P (Snapdragon 810 v2.1) 20Soc 21.94 5.44 4.03 fps/W
Huawei Mate 8 (Kirin 950) 16FF+ 10.37 2.75 3.77 fps/W
Huawei Mate 9 (Kirin 960) 16FFC 32.49 8.63 3.77 fps/W
Huawei P9 (Kirin 955) 16FF+ 10.59 2.98 3.55 fps/W

The Kirin 960’s GPU power and inefficiency was a direct response to market pressure, as well as negative user feedback regarding GPU performance. I don’t really blame Huawei; I highly praised the Mate 8 with its Kirin 950, irrespective of the lower GPU performance, it was an excellent device because the thermals and sustained performance were outstanding. Despite this, the very first comment of that review was a 'despite the GPU …'. Here the average user will just look at the benchmarks and see it’s ranked lower, and not think any better. It also shows that companies do care what users want, and do listen to requests, but might react in a way users were not expecting.

Unfortunately the only way we can avoid this situation of a perceived performance deficit as a whole is if we as journalists, and companies like Huawei, educate users better. It also helps if device vendors have a more steadfast philosophy about remaining within reasonable power budgets.

Huawei and Its Future

Last Friday Huawei’s CEO announced the new Kirin 980, which is set to be the centerpiece in the Mate 20 lineup coming soon. The big messaging for this new chip is that it is on a new 7nm manufacturing node, and the biggest improvements have been on the GPU side. Huawei has promised power efficiency increases of a staggering 178%. If the math checks out and Kirin 980 devices indeed deliver these figures, then it would mean the company would finally get back to sustainable ~3.5W for GPU workloads, and simultaneously be competitive to some degree.

I’ve already seen a lot of users dismiss the GPU performance of the new SoC. It seemingly, as admitted by Huawei, doesn’t beat the peak performance of the Snapdragon 845, the Qualcomm flagship announced last year. Yet this doesn’t matter, because the efficiency should be better for the new SoC. Because of this, real world sustained performance would be better as well, even if the peak figures don’t quite compete.

Here the only thing I can do is reiterate the balance between performance and efficiency as much as I can, in the hope to shift more people away from the narrative of only looking at peak performance. I’m quite happy with our new GPU testing methodology, because frankly it works – our sustained performance numbers were mostly unaffected by the cheating behaviour. Here I see the sustained scores as a good showcase of performance and efficiency across all devices.

The Honor Play: A Gaming Phone, or Just More Marketing?

Returning to square one, one of the reasons we’ve been analysing Huawei & Honor's phones in this level of detail again is because we've been trying to determine what exactly GPU Turbo is. We've addressed that technology in a separate article, and find that it does have technical merit. Here Huawei tried to compensate for its hardware disadvantages by innovating through software. However, software can only do so much, and Huawei tries to exaggerate the benefits of the new technology on devices like the Honor Play.

Unfortunately I see the reasons for the overzealous marketing of GPU Turbo, and the cheating behaviour of this article, as one and the same: the current SoCs are far behind in graphics performance and efficiency. The reality of things is that currently Qualcomm’s GPU architecture has a major advantage in terms of efficiency, which allows it to reach far higher performance figures.

So Honor is trying to promote the Honor Play as a gaming-centric phone, making bold marketing claims about its performance and experience. This is a quite courageous marketing strategy given the fact that the SoC powering the phone is currently the worst of its generation when it comes to gaming. Here the competition just has a major power efficiency advantage, and there is no way around that.

We actively discourage such marketing strategies as it just tries to pull the wool over user’s eyes. While the Honor Play is a quite good phone in itself, a gaming phone it is not. Here we just hope that in the future we’ll see more responsible and honest marketing, as this summer’s materials were rather, incredible, in the worst sense of the word.

Log in

Don't have an account? Sign up now