Name: Arm Announces The Mali-G78 GPU: Evolution to 24 Cores
Item: Arm Announces The Mali-G78 GPU: Evolution to 24 Cores
Author: Andrei Frumusanu

Original Link: https://www.anandtech.com/show/15816/arm-announces-the-malig78-evolution-to-24-cores

Arm Announces The Mali-G78 GPU: Evolution to 24 Cores

VIEW ARTICLE

by Andrei Frumusanu on May 26, 2020 9:00 AM EST

Posted in
Arm
GPUs
SoCs
Mali
Mali G78

36 Comments

Today as part of Arm’s 2020 TechDay announcements, alongside the release of the brand-new Cortex-A78 and Cortex-X1 CPUs, Arm is also revealing its brand-new Mali-G78 and Mali-G68 GPU IPs.

Last year, Arm had unveiled the new Mali-G77 which was the company’s newest GPU design based on a brand-new compute architecture called Valhall. The design promised major improvements for the company’s GPU IP, shedding some of the disadvantages of past iterations and adapting the architectures to more modern workloads. It was a big change in the design, with implementations seen in chips such as the Samsung Exynos 990 or the MediaTek Dimensity 1000.

The new Mali-G78 in comparison is more of an iterative update to the microarchitecture, making some key improvements in the matter of scalability of the configuration as well as balance of the design for workloads, up to some more radical changes such as a complete redesign of its FMA units.

On the scalability side, the new Mali-G78 now goes up to 24 cores in an implementation, which is a 50% increase in core count compared to the maximum MP16 configuration of the Mali-G77. To date, the biggest configuration we’ve seen in the wild of the G77 was the M11 setup of the Exynos 990, with MediaTek employing an MP9 setup.

In a projected end-device solution comparison between 2020 and 2021 devices, Arm is projecting the new Mali-G78 to achieve 25% better performance, which includes both microarchitectural as well as process node improvements. That’s generally the reasonable target that vendors are able to achieve on newer generation IPs, but it’s also going to be strongly depending on the exact process node improvements that are projected here – as GPUs generally scale better with improves process density rather than just frequency and power improvements of the silicon.

At an ISO-process node under similar implementation area conditions, the Mali-G78 is claimed to improve performance density by 15%. This is referring to the either performing 15% better at the same area, or shaving off 15% area for the same performance, given that this can be done linearly by just adjusting the amount of GPU cores implemented.

Power efficiency sees a more meagre 10% improvement, which honestly isn’t too fantastic and not that big of a leap to the Mali-G77. ML performance is also said to be improved by 15% thanks to some new microarchitectural tweaks.

Seemingly, the Mali-G78 doesn’t look like too much of an upgrade compared to the vast new redesign we saw last year with the G77 – and in a sense, that does seem somewhat reasonable. Still, the G78 does some interesting changes to its microarchitecture, let’s dwell a bit deeper into what’s changed…

More Scaling, Different Frequency Domains

For people unfamiliar with the Mali-G77 and the Valhall GPU architecture, I highly recommend on catching up on last year’s deep dive into the changes of the design, as very much the majority of those key elements are still very much present on the new Mali-G78.

Read: Arm's New Mali-G77 & Valhall GPU Architecture: A Major Leap

From a high top-level perspective, the biggest visible change for the new G78 is the promise that it’ll be able to scale up again to 24 GPU cores. For the last few generations of Mali architectures Arm seemingly has been playing catch-up with trying to consolidate their GPU cores into bigger building blocks, with each successive GPU release always trying to improve the per-core performance rather than just adding in more cores.

Last year when Arm had released the G77 the company did exactly this, as pretty much a G77 core is roughly equal in capability to two G76 cores. Chipsets such as the Exynos 990 and Dimensity 1000 had “reasonable” core numbers of 11 and 9, bringing down the core count compared to past Mali GPUs. There’s still a stark contrast to other mobile GPU microarchitectures, such as Qualcomm’s current 2-core Adreno or Apple’s 4-core designs. The problem with scaling up performance with smaller cores is that this is never as power efficient as scaling up fewer bigger cores, as the latter have less duplication of functions, meaning fewer overhead transistors to burn power.

In a sense, the Mali-G78 here scaling up to 24 cores again seems like a step backwards. I had feared that the company had still gone with too small a core on the G77/Valhall architecture, as now seemingly we’re going to have core-count creep again in order to scale up performance.

Configuration wise, the one thing that Arm did away with is the option of a 4MB L2. While the company says it still retains this capability, no vendor had ever chosen to go with such an implementation, with essentially all Mali GPUs to date using 2MB options.

From an execution core perspective, the Mali-G78 remains identical to last year’s G77. The big changes to past G76 designs and prior was the consolidation of multiple execution engines into a single much wider unit, that had also doubled up on the SIMD and warp width of the execution lanes.

The overall core block diagram also remains the same. Key aspects here is the single execution engine, and a quad-pumped texture unit that supports up to 4 texels per clock filtering capability and 2 pixel per clock render output.

The one key changed of the Mali-G78 that Arm had talked about the most, was the change from a single global frequency domain for the whole GPU to a new two-tier hierarchy, with decoupled frequency domains between the top-level shared GPU blocks, and the actual shader cores.

In essence, Arm is introducing asynchronous clock domains within the GPU, allowing the shader cores to operate at a different frequency to the rest of the GPU. This actually can go both ways, with either the cores going faster, or actually slower, than the memory subsystem and tiler blocks.

The main rationale behind this change is to address two problems: geometry throughput and memory throughput for different workloads. In essence, Arm’s GPU architecture has one big problem, and that is that for the GPU to push out a higher number of polygons on screen, the architecture has no option other than trying to scale up its operating frequency. The tiler and geometry engine here are still only able to process a single triangle per clock, and that metric is fixed and non-scalable across GPU configurations.

In recent years, we’ve seen a change in the mobile GPU landscape, particularly with desktop originating titles such as Fortnite and PUBG making it to our smartphones. One aspect of these newer games is that they’re much more geometry heavy than your usual past mobile titles, and seemingly this has become a problem for the Mali architecture.

Arm’s introduction of different frequency domains is a relatively smart solution to the problem. If you can decouple the frequency between your tiler and geometry engine and the actual GPU cores, you can actually solve the issue of there being an imbalance between geometry throughput that’s not scalable in width, and the core-scalable throughput of compute, texturing and pixel engines.

Furthermore, this decoupling also allows to operate the GPU to operate at different voltages between the two domains. The slower domain would be able to operate at a lower frequency and voltage, thus gaining power efficiency, all whilst in theory not impacting performance. The problem with this is that it now forces the SoC vendor to implement an additional voltage domain and power rail – which can add to the costs of the system.

While this all sounds good, I can’t help but think of this being a band-aid solution to a more fundamental problem of the Valhall GPU architecture. The fact that the architecture is only able to support one tiler and geometry engine is the core limitation that lead to this asynchronous top level to be implemented. In the desktop world, we saw the difficult switch to multi-geometry engine architectures almost a decade ago, and it seems to be that the need of such a redesign is also creeping up to the mobile space.

Another significant change the G78 bring is the complete rewrite of its FMA engines. This is said to be a joint-effort with the Arm CPU group, and has resulted in a 30% energy reduction. Key aspects here were the physical separation of the FP32 and FP16 paths, which does cost more transistors and area to implement, but it will have less actual switching transistors when actively operating.

In the G77, Arm says that the FMA units alone accounted for 19% of the dynamic switching energy of the whole GPU. A 30% reduction of that slice means an overall 5-6% improvement of the energy efficiency of the whole GPU, just by that one change.

Finally, a change in the efficiency of the design is improvements in the tiler that allows it to better scale with the increased core counts. The core’s cache shave also had they cache maintenance algorithms improved with better dependency tracking, allowing for the cores to more smartly handle cache data and to avoid unnecessary moving of data which results in a reduction in internal GPU bandwidth as well as power (Or more performance thanks to more available bandwidth).

Small Performance Improvements - Uncertain Projections

Summing up all the different microarchitectural advancements, Arm presents with us the different performance improvements we can expect of the Mali-G78:

On the part of the asynchronous top-level performance improvements the GPU can achieve by improving the geometry to shader core capabilities, Arm projects to see a roughly 8% boost in benchmarks, with a larger ~14% boost in some game titles.

These improvements are quite small, but from a SoC vendor perspective I suppose it wouldn’t be too complicated to implement this, as it would only cost an additional PLL or just a frequency divider in order to achieve the extra performance.

The generational power efficiency improvements of the G78 over the G77 in a similar configuration are 10%, likely attributed to the FMA and cache improvements of the core. It’s small, but we take what we can get.

The async feature from an energy efficiency perspective is proclaimed to be around 6-13% depending on the workload. This is actually a bit of a more complex figure in my view. The main problem in my view is that to achieve this, the SoC vendor needs to actually go ahead and employ a second voltage rail for the GPU to gain the most benefit of the asynchronous frequencies. The efficiency benefit here is small enough, that it begs the question if it’s not just cheaper to add in a few more extra cores and lock them lower, rather than incurring the cost of the extra PMIC rail, inductors and capacitors. It’s an easy efficiency gain for flagship SoCs, but I’m really wondering what vendors will be deploying in the mid-range and lower.

Mali-G68 GPU: It's the same

Alongside the Mali-G78, Arm is today also announcing the new Mali-G68 GPU:

You might be wondering why I’m including this as a footnote at the end of the article rather than covering it in more detail. The truth is, this is the exact same IP as the Mali-G78, with the only difference being that this GPU configuration only scales up to 6 cores. In essence, if the microarchitecture is implemented with up to 6 cores, it’s branded as a G68, and if uses 7 or more cores, it’s branded as a G78.

Arm actually had used this marketing with the G57, which ended up being actually the same IP as the G77, leading to some confusion with the MediaTek Dimensity 800 SoC that was announced earlier this year. We had called that GPU as a derivative of the G77 until MediaTek had reached out to us to point out that it’s actually the same GPU.

It’s pretty disappointing to see Arm do such marketing exercises, as it can be technically misleading. We asked what their rationale is, and they explained that it’s actually a customer demand for them to better differentiate their products. It’s a somewhat credible argument, but on the other hand we’ve had MediaTek outright want to point out to us this misleading branding, so it seems that not everybody is on the same page on the matter.

Arm does say that they possibly envision that future iterations in this series might actually see real microarchitectural differentiations compared to the bigger implementations. In that scenario, the branding at least would make more sense.

Mali-G78: Meagre improvements, or just bad vendor implementations?

If you didn’t already catch on until now, I’m feeling quite pessimistic about the Mali-G78. First of all, it’s just not that big of a generational upgrade compared to the Mali-G77, even by Arm’s own standards and advertised figures.

You could forgive the smaller upgrades if we had started from an excellent baseline performance. The Mali-G77 promised a whole ton of improvements in both performance and efficiency. The actual results we’ve seen out of the Exynos 990 and the MediaTek D1000 were anything but stellar. On one hand we had a SoC which seemingly had a bad implementation on a seemingly immature process node, and on the other hand we had some very mid-range performance even though it was an MP9 GPU configuration. Truth is, we still don’t know if the Mali-G77 is a good GPU or not, as we simply haven’t seen a good implementation out there. If we don’t know if the G77 is good or not, then it’s also impossible to project if the G78 will be any good.

I see Arm having the exact same problem they’ve been facing in the CPU space until the just announced Cortex-X1, as in they’re stuck with having to design a scalable GPU that fits all target markets and having to please all customer design points. Technically, that’s never the best option, as you end up with something that always has compromises.

As for potential implementers of the G78, amongst the biggest vendors it’s likely HiSilicon to be the first adopter – if they can manage to bring out the new Kirin chipsets out to market amidst the current political situation. Whether Samsung and AMD will manage to bring out an RDNA based mobile Exynos next year is also still unclear, though I’m sure that’s what they’re striving for. The biggest issue on the competitive landscape is Apple. Even if the G77 had managed to live up to its projections, the G78 certainly is showcasing too meagre improvements to be able to catch up to the Apple GPUs. We’re also supposed to be seeing the first Imagination A-series GPU SoC designs later this year which is a whole other wildcard. That’s a very tough competitive landscape for Mali – let’s hope the G78 will see more positive success in the future.