Name: AMD Beema/Mullins Architecture & Performance Preview
Item: AMD Beema/Mullins Architecture & Performance Preview
Author: Anand Lal Shimpi

Original Link: https://www.anandtech.com/show/7974/amd-beema-mullins-architecture-a10-micro-6700t-performance-preview

AMD Beema/Mullins Architecture & Performance Preview

VIEW ARTICLE

by Anand Lal Shimpi on April 29, 2014 12:00 AM EST

82 Comments

When AMD launched its Kabini and Temash APUs last year it delivered a compelling cost/performance story, but its power story wasn’t all that impressive. Despite being built out of relatively low power components, nearly all of AMD’s entry level APUs carried 15W TDPs, with a couple weighing in at 8 - 9W and only a single 1GHz dual-core part dropping down to 3.9W. By comparison, Intel was shipping full blown Haswell Ultrabook parts at 15W - offering substantially better CPU performance, in a similar thermal envelope (although at a higher cost). The real disruption for AMD was Intel’s Bay Trail, which showed up with a similar looking micro architecture running at substantially higher clock speeds and TDPs below 8W.

AMD seemed to have all of the right pieces to build a power efficient mobile SoC, but for some reason we weren’t seeing it. Today that begins to change with the the successors to Kabini and Temash.

Codenamed Beema and Mullins, these are the 2014 updates to Kabini and Temash (respectively). Beema is aimed at entry level notebooks, while Mullins targets tablets. For now, both are designed for Windows machines. Although I suspect we’ll eventually see AMD address the native Android market head on, for now AMD is relying on running Android on top of Windows for those who really want it. No word on if/when we’ll get a socketed Beema for entry level desktops.

Like their predecessors, Beema and Mullins combine four low power AMD x86 cores (Puma+ this time, instead of Jaguar) with 128 GCN based Radeon GPU cores. AMD will continue to offer a couple of dual-core SKUs, but they are harvested from a quad-core die. AMD remains unwilling to release official die area figures, but there is a slight increase in transistor count:

AMD/Intel Transistor Count & Die Area Comparison
SoC	Process Node	Transistor Count	Die Area
AMD Zacate	TSMC 40nm	450M+	75mm²
AMD Kabini/Temash	TSMC 28nm	914M	~107mm² (est)
AMD Beema/Mullins	GF 28nm	930M	~107mm² (est)
AMD Llano	GF 32nm SOI	1.18B	228mm²
AMD Trinity/Richland	GF 32nm SOI	1.30B	246mm²
AMD Kaveri	GF 28nm SHP	2.41B	245mm²
Intel Haswell (4C/GT2)	Intel 22nm	1.40B	177mm²

I’d expect a similar die size to Kabini/Temash. It’s interesting to note that these SoCs have a transistor count somewhere south of Apple’s A7.

Puma+ is based on the same micro architecture as Jaguar. We’re still looking at a 2-wide OoO design with the same number of execution units and data structures inside the chip. The memory interface remains unchanged as well at 64-bits wide. These new SoCs are still built on the same 28nm process as their predecessor. The process however has seen some improvements. Not only are both the CPU and GPU designs slightly better optimized for lower power operation, but both benefit from improvements to the manufacturing process resulting in substantial decreases in leakage current.

AMD claims a 19% reduction in core leakage/static current for Puma+ compared to Jaguar at 1.2V, and a 38% reduction for the GPU. The drop in leakage directly contributes to a substantially lower power profile for Beema and Mullins.

AMD also went in and tweaked the SoC’s memory interface. Kabini/Temash had a standard PC-like DDR3 memory interface. All of the complexity required for broad memory module compatibility and variations in trace routing was handed by the controller itself. This not only added complexity to the DDR3 interface but power as well. With Beema and Mullins, AMD took a page from the smartphone SoC design guide and traded flexibility for power. These platforms now ship with more strict guidelines as to what sort of memory can be used on board and how traces must be routed. The result is a memory interface that shaves off more than 500mW when in this more strict, low power mode. OEMs looking to ship a design with socketed DRAM can still run the memory interface in a higher power mode to ensure memory compatibility.

These SoCs won’t be available in a PoP configuration unfortunately - OEMs will have to rely on discrete DRAM packages rather than a fully integrated solution. Beema/Mullins also show up to a 200mW reduction in power consumed by the display interface compared to Kabini/Temash.

The combination of all of this is 20% lower idle power compared to the previous generation of AMD entry level and low power APUs. AMD put together a nice graph illustrating its progress over the years:

Beema and Mullins are definitely in a good place, however they still do consume more power at idle than the smartphone SoCs we typically find in iOS and Android tablets. AMD isolated APU power for the graph above and is using an “eReader” workload (aka display on but not animating, system otherwise idle). It just so happens I gathered similar data for our 2013 Nexus 7 review. The workloads and measurements are different (AMD isolates APU power, I’m looking at total platform power minus display) but it’s enough to put things in perspective:

SoC Idle Power Comparison

AMD has dropped power consumption considerably over the years, but it’s still not as power efficient as high end mobile silicon.

AMD sees no value in supporting Microsoft's Connected Standby standard at this point, which makes sense given the limited success of Windows 8 tablets. Once again this seems to point to AMD eventually adopting Android for its tablet aspirations.

Looking forward, AMD has more tricks up its sleeve to continue to drive power down. Most interesting on the list? We’ll see an integrated voltage regulator (ala Haswell’s FIVR) from AMD in 2015.

New Turbo Boost

With power in perspective, let’s talk about performance and the lineup. It always made little sense that despite a very competitive microarchitecture, Jaguar both consumed more power and performed worse than Intel’s Silvermont. It turns out that’s more a function of the limited time AMD’s Jaguar team had to bring the design to market. As the basis not only for AMD’s own entry level APUs but also the semi-custom SoCs bids for consoles from Microsoft and Sony, Jaguar had to be done quickly. With Puma+ and its associated SoC designs, AMD could focus more on driving power down and introducing new features, one of which happens to be a very intelligent clock boosting scheme analogous to Intel’s Turbo Boost.

While the bulk of Kabini and Temash silicon ran up to a set maximum frequency, Beema and Mullins SoCs can take advantage of available thermal headroom to increase their maximum frequency for a limited period of time. If we look at the tables below we’ll see this in action:

Mullins vs. Temash - Frequency Gains
	TDP	Max CPU Frequency	Temash Equivalent	Temash Equivalent (TDP)	Temash Max CPU Frequency	Max Frequency Increase from Mullins
A10 Micro-6700T	4.5W	2.2GHz	A6-1450	8W	1.4GHz	57%
A4 Micro-6400T	4.5W	1.6GHz	A4-1250	9W	1.0GHz	60%
E1 Micro-6200T	3.95W	1.4GHz	A4-1200	3.9W	1.0GHz	40%

AMD no longer reports max non-turbo frequency, unfortunately following in Intel’s footsteps (as well as the rest of the mobile players), but you can assume that they are mostly unchanged from Kabini/Temash. Beema and Mullins can now turbo up to much higher frequencies. In the case of Mullins in particular, since it’s so thermally constrained, the potential upside for frequency scaling is huge.

Beema vs. Kabini - Frequency Gains
	TDP	Max CPU Frequency	Kabini Equivalent	Kabini Equivalent (TDP)	Kabini Max CPU Frequency	Max Frequency Increase from Beema
A6-6310	15W	2.4GHz	A6-5200	25W	2.0GHz	20%
A4-6210	15W	1.8GHz	A4-5000	15W	1.5GHz	20%
E2-6110	15W	1.5GHz	E2-3000/E1-2500	15W	1.65GHz/1.4GHz	-10%/7%
E1-6010	10W	1.35GHz	E1-2100	9W	1.0GHz	35%

The frequency gains aren't just limited to the CPU, the 128 GCN cores can also run at higher speeds with Beema and Mullins:

Mullins vs. Temash - GPU Frequency Gains
	TDP	Max GPU Frequency	Temash Equivalent	Temash Equivalent (TDP)	Temash Max GPU Frequency	Max GPU Frequency Increase from Mullins
A10 Micro-6700T	4.5W	500MHz	A6-1450	8W	400MHz	25%
A4 Micro-6400T	4.5W	350MHz	A4-1250	9W	300MHz	16%
E1 Micro-6200T	3.95W	300MHz	A4-1200	3.9W	225MHz	33%

Beema vs. Kabini - GPU Frequency Gains
	TDP	Max GPU Frequency	Kabini Equivalent	Kabini Equivalent (TDP)	Kabini Max GPU Frequency	Max GPU Frequency Increase from Beema
A6-6310	15W	800MHz	A6-5200	25W	600MHz	33%
A4-6210	15W	600MHz	A4-5000	15W	500MHz	20%
E2-6110	15W	500MHz	E2-3000/E1-2500	15W	450/400MHz	11%/25%
E1-6010	10W	350MHz	E1-2100	9W	300MHz	16%

How can AMD hit significantly higher frequencies without a substantial architecture change or new process node? By raising the max thermal operating point of the silicon. Similar to what Intel discovered in architecting its Bay Trail silicon, AMD realized that in ultra portable form factors it would run into a chassis temperature limit before it ever reached the maximum operating temperature of its silicon.

Previously once the silicon temperature hit 60C, AMD would cap max CPU/GPU frequency. However what really matters isn’t if the silicon is running warm but rather if the chassis is running too warm. With Beema and Mullins, AMD increases the silicon temperature limit to around 100C (still within physical limits) but instead relies on the surface temperature of the device to determine when to throttle back the CPU/GPU. In AMD’s own words, this allows the SoC to run at a much higher frequency for up to several minutes before having to scale back down. As long as the physical limits of the die aren’t exceeded, the design remains just as safe as before, but you get better performance.

The real trick is that AMD is able to enable this new chassis temperature governed boost (called Skin Temperature Aware Power Management - STAPM) without requiring any additional sensors or hardware from the OEM. What AMD does instead is gives the OEM tools to properly map SoC temperature to chassis skin temperature. My guess is the OEM runs a set workload, measuring external chassis temperature all while correlating that data with SoC temperature. This mapping will vary on a device by device basis, and obviously won’t be as accurate as having a thermal sensor on the chassis itself, but it’s good enough to get the job done.

AMD claims it’s intelligent about when to boost. The updated power management unit looks at the response to frequency scaling of a given workload and will only boost when the workload will actually benefit from being boosted. This evaluation happens at the hardware instruction level and not at the OS/software layer.

The Lineup

With the exception of compressing the Kabini family into four parts instead of five, AMD kept the same number of SKUs as last year but obviously with updated specs with Beema and Mullins:

AMD Mullins vs. Temash APUs
Model	Radeon Brand	SDP	TDP	CPU Cores	CPU Clock Speed (Max)	L2 Cache	Radeon Cores	GPU Clock Speed (Max)	DDR3 Speed (Max)
A10 Micro-6700T	R6	2.8W	4.5W	4	2.2GHz	2MB	128	500MHz	1333
A4 Micro-6400T	R3	2.8W	4.5W	4	1.6GHz	2MB	128	350MHz	1333
E1 Micro-6200T	R2	2.8W	3.95W	2	1.4GHz	1MB	128	300MHz	1066
A6-1450	HD 8250		8W	4	1.4GHz	2MB	128	400MHz	1066
A4-1250	HD 8210		9W	2	1.0GHz	1MB	128	300MHz	1333
A4-1200	HD 8180		3.9W	2	1.0GHz	1MB	128	225MHz	1066

The Mullins parts get a Micro prefix in front of their model number, implying the SoC's tablet-friendliness. AMD also supplies both TDP and Scenario Design Power (SDP) values for Mullins SoCs, similar to what Intel does with Bay Trail. The latter uses more tablet-like workloads (read: lighter weight) while determining SoC power.

With the exception of the entry level E1 Micro-6200T, TDPs go down substantially with Mullins vs. Temash. Cache sizes and GPU core count remain unchanged, but CPU frequencies and max DRAM frequency supported goes up in many cases.

AMD Beema vs. Kabini APUs
Model	Radeon Brand	SDP	TDP	CPU Cores	CPU Clock Speed (Max)	L2 Cache	Radeon Cores	GPU Clock Speed (Max)	DDR3 Speed (Max)
A6-6310	R4		15W	4	2.4GHz	2MB	128	800MHz	1866
A4-6210	R3		15W	4	1.8GHz	2MB	128	600MHz	1600
E2-6110	R2		15W	4	1.5GHz	2MB	128	500MHz	1600
E1-6010	R2		10W	2	1.35GHz	1MB	128	350MHz	1333
A6-5200	HD 8400		25W	4	2.0GHz	2MB	128	600MHz	1600
A4-5000	HD 8330		15W	4	1.5GHz	2MB	128	500MHz	1600
E2-3000	HD 8280		15W	2	1.65GHz	1MB	128	450MHz	1600
E1-2500	HD 8240		15W	2	1.4GHz	1MB	128	400MHz	1333
E1-2100	HD 8210		9W	2	1.0GHz	1MB	128	300MHz	1333

Beema sees the end of the lone 25W TDP for Kabini, everything is now at 15W or less. The lowest end Beema carries a slightly higher TDP than the entry level Kabini, but otherwise there's more performance at the same TDP across the board. Beema parts don't come with an SDP rating as they're designed for use in more traditional ultrathin notebook PC form factors (presumably running more traditional, read: heavier, workloads).

TrustZone

In 2012 AMD announced that it had signed a license agreement with ARM. Although we’ve since seen AMD announce ARM based Opteron silicon, back then the only official commitment was to ship an x86 SoC in 2013 with an integrated ARM Cortex A5 for TrustZone execution. AMD needed a hardware security platform on its SoCs to remain competitive, and it didn’t have one of its own (Intel’s TXT is proprietary and not a part of what’s licensed to AMD) so ARM’s TrustZone technology was an easy target. To support TrustZone you need an ARM core, and thus AMD committed to integrating a Cortex A5 as a dedicated security processor on some of its 2013 APUs.

Indeed both Kabini and Temash had a Cortex A5 on die, it was simply never enabled due to time constraints. With Beema and Mullins the core is fully functional in what AMD is calling its Platform Security Processor (PSP). AMD will likely publish guidelines on how developers can access and use the PSP, and I’d also expect to see it make its way into other AMD APUs moving forward.

The Discovery Tablet

AMD is in a difficult position these days. Traditionally it was the cheaper alternative to Intel, but with Bay Trail Intel made a serious push into segments where OEMs would traditionally use lower cost AMD silicon. In an attempt to be more than a lower cost Intel alternative, AMD is throwing its hat into the form factor reference design race and offering OEMs an example of a full, ideal, high performance implementation of its silicon. One such example is AMD’s Discovery Tablet, an 11.6-inch 1080p Windows 8.1 tablet design that features AMD’s highest end A10 Micro-6700T Mullins silicon. The tablet is a bit larger and heavier than I’d like. If AMD is going to build a reference platform I’d prefer it to be a form factor I’d actually use, which in a tablet is going to be something smaller than 11.6-inches. If you are trying to cover both tablet and 2-in-1 form factors however, the Discovery Tablet makes sense.

I was allowed to spend a few hours benchmarking AMD’s Discovery Tablet. Unfortunately the device wasn’t instrumented for power testing, nor was there enough time to run any battery life tests on it, so the usefulness of these numbers is limited. We already know that AMD’s idle power isn’t as good as smartphone silicon, but for some of these value Windows 8.1 devices it may still be good enough.

Gallery: AMD Discovery Tablet

Tablet JS/Web Browser Tests

We'll start with our usual set of JavaScript tests. Here we see AMD's A10 Micro-6700T outperform everything on the list. Whether we're talking about Bay Trail or Apple's A7, the 6700T pulls ahead by a decent margin. Once again the big question is how much power is being drawn to deliver this performance. Unlike Intel's Bay Trail preview, AMD didn't have any instrumented Discovery tablets setup for us to monitor power consumption. I suspect AMD's power consumption is competitive, but my guess is it isn't similarly class leading.

SunSpider 1.0 Benchmark

Mozilla Kraken Benchmark (Stock Browser)

WebXPRT (Chrome/Mobile Safari)

CPU Performance

For these next tests I turned to some of our more traditional Windows PC benchmarks. I looked at Cinebench 11.5 to get an idea for how single and multithreaded performance have changed since last year.

Looking at single threaded performance we immediately see the benefits of AMD's new boosting capabilities. The Puma+ cores are 35% faster than Intel's Silvermont cores, and can deliver nearly 80% of the performance of AMD's Piledriver cores found in Trinity. I threw some Llano results in here as well - Mullins offers around 85% of the performance of Llano.

Cinebench R11.5 - Single-Threaded Benchmark

Multithreaded performance is pretty evenly matched between Bay Trail and Mullins here. Note that Mullins manages to deliver very similar performance to Kabini, despite coming in at substantially less power. The comparison to Brazos (E-350) is laughable, Mullins is substantially faster.

Cinebench R11.5 - Multi-Threaded Benchmark

PCMark 7 gives us a better look at the overall performance of the Discovery tablet hardware. Here we do see it lose ground to the Kabini notebook (A4-5000) as well as the Bay Trail devices. It's unclear to me if we are seeing the thermal limits of the hardware (this is a longer test) or if there are other elements at work here (e.g. storage performance limits).

PCMark 7 (2013)

GPU Performance

We don’t actually have any Bay Trail devices in our Laptop 2013 bench, they are all in our Tablet 2013 category, which unfortunately uses different benchmarks. To make a long story short, we have Bay Trail vs. Kabini data, and Kabini vs. Mullins data. Thankfully the comparison between Bay Trail and Mullins is pretty easy to make.

AMD’s 4.5W TDP A10 Micro-6700T delivers roughly the same GPU performance as a 15W A4-5000. The A4-5000 also ends up being anywhere from 50% to over 2x the speed of Bay Trail when it comes to GPU performance, so you can expect Mullins to hold roughly the same advantage.

Compared to the old 35W Trinity, Mullins still has a ways to go. Trinity delivers roughly 2x the performance of Mullins, although at nearly 10x the TDP.

Futuremark 3DMark (2013)

Final Words

Despite no significant changes to the architecture or manufacturing process, AMD’s 2014 updates to its entry level and low power silicon are substantial. We finally have AMD silicon, built around a non-Bulldozer architecture, that seem to have turbo capabilities comparable to Intel’s. The result is a completely different performance profile. While AMD’s Jaguar cores in Kabini and Temash were easily outperformed by Intel’s Bay Trail, Puma+ pulls ahead. AMD continues to hold a substantial GPU performance advantage as well.

The gains in performance come while decreasing platform power. You can now have roughly the same performance as AMD offered last year in a 15W entry level notebook part, in a 4.5W TDP (2.8W SDP) tablet SKU. That’s seriously impressive.

The progress AMD made in a year with Beema and Mullins shows just how time constrained the team(s) were with bringing Kabini and Temash to market in 2013. While both of those SoCs were quite successful for AMD, I expect that at some point AMD won’t be allowed two years to fully polish a single design.

The big unknown is how these new SoCs stack up against Bay Trail when it comes to power consumption. From a performance standpoint at the very high end they are faster, but we’ll have to wait until we can get our hands on shipping devices before we know the full story when it comes to battery life. AMD expects to see Beema and Mullins designs show up over the next 1 - 2 quarters, with some designs shipping in the coming weeks to specific regions.

The other thing we need to see is a real Android strategy from AMD. Mullins seems like a good fit for a high performance Android tablet, but today AMD’s native OS strategy is exclusively Windows. I don’t think it’ll stay that way for long, but AMD has yet to give us any indication of when it’ll change.

And if I’m asking for things I want to see from AMD, you can add a PoP package and idle power that’s competitive with the likes of Apple and Qualcomm. AMD clearly came a long way over the past couple of years, but there’s still more progress to be made.