One thing I curious is why does Samsung use this CPU in Europe products and use SnapDragon in US, Japan Markets - Only thing I can think of regulations for security on the chips.
Unless you assume that some people actually travel. You might even argue that the average European is more likely to enter CMDA space in his life-time than the other way around: Most US citizens only ever leave the country to fight a war, I keep hearing.
I prefer to chose myself than have choices mandated to me based on where I tend to be most of the time.
Samsungs own Shannon modem works on CDMA. My galaxy S6 on Verizon Wireless is proof of that. It's probably legal reasons, patents, licensing whatever. It's annoying. Because the Qualcomm has put out some duds and we don't have a choice.
That sounds about right - but I believe the Qualcomm version is faster than the Samsung version. This is a problem with have Modem built into chip - you have to used different CPU depending on which modem is in the SOC,
There is another more or less official reason - to reduce risks - instead of only relying on one chip, they use 2, one in-house designed and one external, in case the internal chip has issues. Samsung anyhow manufactures both in-house.
The issue is not technology (everyone knows CDMA basics because you need it on the data side for WCDMA aka 3G). The issue is licensing. Maybe QC changes the licensing terms so that doing it in-house was simply not worth the cost? This seems very much in-line with the world-wide lawsuits against QC for various anti-competitive behavior.
I when they use Exynos they have less royalty fees to pay since it's their own processor. EU market is smaller than the US or ASIA so they compensate by providing chip with cheaper costs.
PS - CDMA, modem and all other reasons mentioned are irrelevant; other brands use Qualcom in EU and all works perfectly.
Unused cores do not consume energy at all regardless of voltage. Thus there should be two threads on performance cores, one requires 3 GHz, but second do not, to lose efficiency.
While technically you're right that the core itself doesn't draw power when it's powered down, there's other factors to consider: 1) the power cost to power it down and back up when you need it again. If you use the core intermittently then the cost of powering it up and down can exceed the cost of operating it for the workload, and then it may be more efficient to keep those jobs on the less efficient core(s) 2) the uncore, ie the busses and support logic that don't get powered down, can use nearly as much power as the cores themselves. In some CPUs, ie Ryzen2 TRW 2990WX the uncore actually consumes 76% of the die power when there's only 2 cores out of 32 active.
"When Arm disclosed the A76 µarch details and particularly the 128-entry ROB (which in comparison seems quite small to the M3), they said that this was a balance between performance and area/power. In particular we saw a mention that a 7% increase in the ROB capacity only came with a 1% performance gain on average."
ROB per se costs very little, it's just a queue. What is expensive is the physical register file, and the fact that it more or less scales with the size of the ROB (since most instructions generate one value, which consumes one physical register). The trick, then, is what can you do to increase the size of the ROB (which allows you to do more work during dead periods while your ROB-blocking instruction at the head of the queue is waiting on DRAM) without paying the cost of the register file?
There is a bag of tricks for this, and as you make your CPU more advanced, you use more and more of them. They include - clustering (so you duplicate the register file, and have half the execution units use one of the register files, the other half use the other register file. This works because there are quadratic aspects to the register file, so cutting some things in half [even if they are then duplicated] reduces area/power by four. - various "resource amplification" techniques getting more use out of what you have. These might be giving your register file one fewer read ports, but then having a smart allocator that can cope if the reads are oversubscribed. Or it might be delayed register allocation and early register release (so the register is held for a shorter time). Or it might be various forms of instruction fusion. - you can try to set aside instructions that you believe will be dependent on the blocking instruction, so that they do not even get allocated a register until the blocking instruction completes. There has been some very interesting recent work on how you can do this without requiring long range communication, so while this has been talked about for 20 years, we might soon see actual implementations. - a variant on the above is you can measure how "critical" instructions are (ie does the rest of the computation get delayed if this one instruction gets delayed. Based on this knowledge, you can send through critical instructions when resources are scarce, and delay non-critical until more resources are available.
The larger point here is that the reason Samsung et al are happy to tell you the info they are telling you (number of ROB slots, numbers of execution units, etc) is because this stuff is thoroughly uninteresting and uncompetitive. The competitive stuff is the sort of thing I have described above --- how does company A get 1.5x the performance from a certain level of HW versus company B --- and that's what no-one is ever willing to talk about... Best you can do is see what look like good ideas in the academic literature of a few years ago and then assume that at least some of them have been picked up.
"The fetch unit’s bandwidth has been doubled and now can read up to 48 bytes per cycle which corresponds to 12 32b instructions per cycle – this results in a 2:1 ratio of fetch versus decode capacity which is an increase over the 1.5:1 ratio (24B/c, 4 decode) in the M1. Samsung explains that the big increase is needed to combat the increasingly big problem of branch bubbles on wider microarchitectures. They admit that on average, the distance between taken branches is less than 12 instructions, but the larger width helps a lot for temporary bursts of instructions."
One point you missed is that it's generally not worth the (area+power) cost to allow two lines per cycle to be read from your I-cache. So that 12-wide fetch is a maximum, which can only be reached if the starting address is one of the first four (of sixteen) instructions in the line. Every later instruction gives you a shorter fetch because you only extend to the end of the line. Averaged across all possibilities, your average fetch width is something like nine, which is still larger than the six sustained you need, but not quite as extravagant as it seems.
It's also the case that the way these multi-level branch predictors work (at least sometimes, who knows exactly what SS are doing) is that a first fast prediction is made, then the next cycle or two, that prediction is confirmed against the larger prediction data structures. If the later prediction disagrees, you get a flush --- but ideally you can just flush what's in the I-queue before it hit decode, you don't have to flush the entire pipeline. Point is, however, you are now also using up some fraction of that 9-wide-fetch (on average) on fetches that get tossed before they even hit decode :-( And for this to work well, you want the I-queue to ALWAYS be somewhat fullish, so you want it filling up fast, right after any sort of flush event.
So all things considered, 12-wide fetch is probably optimal, not at all extravagant.
Andrei, Thanks for the in-depth coverage, especially the added information from your own deep dive from a little while ago! I wonder if Samsung even commented on the software-induced self-inflicted injury that really hogtied what, by its specs, should have been a faster alternative to the A75/845. Also, did anybody from Samsung thank you for showing them how to partially improve the M3's performance in your deep dive, at least informally/in the hallway? Somehow, the M3 is a very "Samsung" product: okay - great hardware, flawed - awful software. All that being said: any mentioning or rumors on "Windows on Exynos"?
Today's disclosures are just on the µarch - the CPU design teams are not in charge of any of the software which is S.LSI's and Samsung Mobile's responsibilities.
Thanks Andrei, I get that the CPU design teams are not in charge of the software. Still, I imagine that as a member of the CPU design team, I would have had some very unkind words for the software guys (and gals) who made quite a mess and made the CPU look bad. Regarding the apparently pretty strict division between even low-level software and hardware at Samsung: Do you think that is part of the problem? Even the best micro-arch can only work as well as the software that runs it allows for. Don't micro-arch + low-level software teams usually work closely together starting at the design stage? How is that handled at Intel, AMD, Qualcomm, Nvidia?
Ah NVM didn't see the SIMD blocks below the FMAC blocks, my bad. Should be able to Vector FMA right up to 24 SP flops/clock in theory/never in actual workloads. What a beast!!
Great article, as always. Heavy on the technical aspects, just like we like it. He's not wrong about the fact that it would benefit from an editor, though. You'd get some easy wins by passing it through a grammar checker if there's no-one available to proof read your articles. Also, if the page used a font where you can differentiate between lower case L and capitol i (l/I) it would make a lot of terms easier to parse.
While I was reading I made a list of text replacements that would improve readability. The list is way too large for a comment field, so I'm sending it via email.
Was this article written by an English speaker or translated through google translate or something. Apart from the other bits, this is truly the worst sentence I have read in a long while: We’ve exclusively first reported on the details of the new microarchitecture back in January and it was clear from that point on that this was a big one: Samsung had went for a huge push in terms of performance, resulting in one of the biggest generational jumps of any silicon CPU designer in recent history.
So, I don't have the best grammar in the world, but I think this is better:
We first reported on the details of the new microarchitecture exclusively back in January, and it was clear from that point on that this a big one. Samsung had gone for a huge performance push, resulting in one of the biggest generational jumps of any CPU design in recent history.
I suspect Andrei speaks rather better German or French than you do.
You might still think you're living in the 1950s, but the rest of us are well aware that it's the 21st Century; and it's just basic good manners to be grateful that so many smart people in the world are willing to speak English, rather than complaining that they don't do so perfectly.
The M3 should have just been an iterative improvement over the M2, as the M2 was over the M1. The M3 was rushed in design, execution and process node. So much untapped potential.
Oh well... I guess the M3 was a good beta for optimizing the M4 further, while retaining its competitiveness with Qualcomm and Apple in real world usage. I have no doubt the M4 will be best in class, but I just can't hide the fact that the M3 left a bitter taste in my mouth; it should have been a clear win from all aspects.
Knowing that this might rub some here the wrong way: While I agree that the style and grammar of Andrei's article could have been improved by an editor or proofreader, I mainly read articles here at Anandtech for their content, not style or grammar. As far as I can tell, Andrei's article was not littered with wrong or confusing statements or facts; conversely, it was an (obviously) quickly written, yet accurate summary of a very recent presentation at Hot Chips with some helpful background added. So, while I agree that a quick proofreading, preferably by another pair of eyes, would have likely avoided some issues, I for one prefer high quality, accurate breaking news in occasionally imperfect English over perfect prose with missing or wrong facts and weeks late. So, can we get back to the technology, which is what brings us here?
As you hint, the power consumption disadvantages might not be as significant in a bigger form factor, such as an Slimbook (can't say Ultrabook, right?) or a tablet. Even the Kirin 950 (or was it 960?) which burned way too much in a phone seems to be adequate enough in tablets, where the larger displays tend to offset what the SoC eats in batteries.
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
45 Comments
Back to Article
HStewart - Monday, August 20, 2018 - link
One thing I curious is why does Samsung use this CPU in Europe products and use SnapDragon in US, Japan Markets - Only thing I can think of regulations for security on the chips.Great_Scott - Monday, August 20, 2018 - link
My understanding is that the US version uses the SnapDragon for the integral modem.I'm not exactly sure why this is, it might be an artifact of needing legacy CDMA support for Verizon/Sprint, which isn't needed for a world-phone.
beginner99 - Tuesday, August 21, 2018 - link
That is what I thought as well. CDMA. Not needed for rest of the world.abufrejoval - Friday, August 24, 2018 - link
Unless you assume that some people actually travel. You might even argue that the average European is more likely to enter CMDA space in his life-time than the other way around: Most US citizens only ever leave the country to fight a war, I keep hearing.I prefer to chose myself than have choices mandated to me based on where I tend to be most of the time.
Ej24 - Friday, August 24, 2018 - link
Samsungs own Shannon modem works on CDMA. My galaxy S6 on Verizon Wireless is proof of that. It's probably legal reasons, patents, licensing whatever. It's annoying. Because the Qualcomm has put out some duds and we don't have a choice.az060693 - Monday, August 20, 2018 - link
It's supposedly due to a 1993 patent licensing deal- https://www.androidcentral.com/qualcomm-licensing-...Though in certain phone generations, it might have also been to supply issues.
HStewart - Monday, August 20, 2018 - link
That sounds about right - but I believe the Qualcomm version is faster than the Samsung version. This is a problem with have Modem built into chip - you have to used different CPU depending on which modem is in the SOC,bebby - Tuesday, August 21, 2018 - link
There is another more or less official reason - to reduce risks - instead of only relying on one chip, they use 2, one in-house designed and one external, in case the internal chip has issues. Samsung anyhow manufactures both in-house.name99 - Monday, August 20, 2018 - link
At least one reason they use Snapdragon is for markets that still require voice CDMA, which QC modems provide.aryonoco - Monday, August 20, 2018 - link
The S6 showed that Samsung has no problems making integrated CDMA modems.name99 - Wednesday, August 22, 2018 - link
The issue is not technology (everyone knows CDMA basics because you need it on the data side for WCDMA aka 3G). The issue is licensing.Maybe QC changes the licensing terms so that doing it in-house was simply not worth the cost? This seems very much in-line with the world-wide lawsuits against QC for various anti-competitive behavior.
petar_b - Saturday, November 3, 2018 - link
I when they use Exynos they have less royalty fees to pay since it's their own processor. EU market is smaller than the US or ASIA so they compensate by providing chip with cheaper costs.PS - CDMA, modem and all other reasons mentioned are irrelevant; other brands use Qualcom in EU and all works perfectly.
AlB80 - Monday, August 20, 2018 - link
Unused cores do not consume energy at all regardless of voltage. Thus there should be two threads on performance cores, one requires 3 GHz, but second do not, to lose efficiency.linuxgeex - Tuesday, August 21, 2018 - link
While technically you're right that the core itself doesn't draw power when it's powered down, there's other factors to consider:1) the power cost to power it down and back up when you need it again. If you use the core intermittently then the cost of powering it up and down can exceed the cost of operating it for the workload, and then it may be more efficient to keep those jobs on the less efficient core(s)
2) the uncore, ie the busses and support logic that don't get powered down, can use nearly as much power as the cores themselves. In some CPUs, ie Ryzen2 TRW 2990WX the uncore actually consumes 76% of the die power when there's only 2 cores out of 32 active.
name99 - Monday, August 20, 2018 - link
"When Arm disclosed the A76 µarch details and particularly the 128-entry ROB (which in comparison seems quite small to the M3), they said that this was a balance between performance and area/power. In particular we saw a mention that a 7% increase in the ROB capacity only came with a 1% performance gain on average."ROB per se costs very little, it's just a queue.
What is expensive is the physical register file, and the fact that it more or less scales with the size of the ROB (since most instructions generate one value, which consumes one physical register).
The trick, then, is what can you do to increase the size of the ROB (which allows you to do more work during dead periods while your ROB-blocking instruction at the head of the queue is waiting on DRAM) without paying the cost of the register file?
There is a bag of tricks for this, and as you make your CPU more advanced, you use more and more of them. They include
- clustering (so you duplicate the register file, and have half the execution units use one of the register files, the other half use the other register file. This works because there are quadratic aspects to the register file, so cutting some things in half [even if they are then duplicated] reduces area/power by four.
- various "resource amplification" techniques getting more use out of what you have. These might be giving your register file one fewer read ports, but then having a smart allocator that can cope if the reads are oversubscribed. Or it might be delayed register allocation and early register release (so the register is held for a shorter time). Or it might be various forms of instruction fusion.
- you can try to set aside instructions that you believe will be dependent on the blocking instruction, so that they do not even get allocated a register until the blocking instruction completes. There has been some very interesting recent work on how you can do this without requiring long range communication, so while this has been talked about for 20 years, we might soon see actual implementations.
- a variant on the above is you can measure how "critical" instructions are (ie does the rest of the computation get delayed if this one instruction gets delayed. Based on this knowledge, you can send through critical instructions when resources are scarce, and delay non-critical until more resources are available.
The larger point here is that the reason Samsung et al are happy to tell you the info they are telling you (number of ROB slots, numbers of execution units, etc) is because this stuff is thoroughly uninteresting and uncompetitive. The competitive stuff is the sort of thing I have described above --- how does company A get 1.5x the performance from a certain level of HW versus company B --- and that's what no-one is ever willing to talk about...
Best you can do is see what look like good ideas in the academic literature of a few years ago and then assume that at least some of them have been picked up.
jospoortvliet - Wednesday, August 29, 2018 - link
Mja this stuff might be uninteresting for deep techies but for many it is nice to get some degree of comparison between the various CPU's being built.name99 - Monday, August 20, 2018 - link
"The fetch unit’s bandwidth has been doubled and now can read up to 48 bytes per cycle which corresponds to 12 32b instructions per cycle – this results in a 2:1 ratio of fetch versus decode capacity which is an increase over the 1.5:1 ratio (24B/c, 4 decode) in the M1. Samsung explains that the big increase is needed to combat the increasingly big problem of branch bubbles on wider microarchitectures. They admit that on average, the distance between taken branches is less than 12 instructions, but the larger width helps a lot for temporary bursts of instructions."One point you missed is that it's generally not worth the (area+power) cost to allow two lines per cycle to be read from your I-cache. So that 12-wide fetch is a maximum, which can only be reached if the starting address is one of the first four (of sixteen) instructions in the line. Every later instruction gives you a shorter fetch because you only extend to the end of the line.
Averaged across all possibilities, your average fetch width is something like nine, which is still larger than the six sustained you need, but not quite as extravagant as it seems.
It's also the case that the way these multi-level branch predictors work (at least sometimes, who knows exactly what SS are doing) is that a first fast prediction is made, then the next cycle or two, that prediction is confirmed against the larger prediction data structures. If the later prediction disagrees, you get a flush --- but ideally you can just flush what's in the I-queue before it hit decode, you don't have to flush the entire pipeline. Point is, however, you are now also using up some fraction of that 9-wide-fetch (on average) on fetches that get tossed before they even hit decode :-(
And for this to work well, you want the I-queue to ALWAYS be somewhat fullish, so you want it filling up fast, right after any sort of flush event.
So all things considered, 12-wide fetch is probably optimal, not at all extravagant.
jospoortvliet - Wednesday, August 29, 2018 - link
Keep commenting I love this ;-)eastcoast_pete - Monday, August 20, 2018 - link
Andrei, Thanks for the in-depth coverage, especially the added information from your own deep dive from a little while ago! I wonder if Samsung even commented on the software-induced self-inflicted injury that really hogtied what, by its specs, should have been a faster alternative to the A75/845. Also, did anybody from Samsung thank you for showing them how to partially improve the M3's performance in your deep dive, at least informally/in the hallway? Somehow, the M3 is a very "Samsung" product: okay - great hardware, flawed - awful software.All that being said: any mentioning or rumors on "Windows on Exynos"?
Andrei Frumusanu - Monday, August 20, 2018 - link
Today's disclosures are just on the µarch - the CPU design teams are not in charge of any of the software which is S.LSI's and Samsung Mobile's responsibilities.eastcoast_pete - Monday, August 20, 2018 - link
Thanks Andrei, I get that the CPU design teams are not in charge of the software. Still, I imagine that as a member of the CPU design team, I would have had some very unkind words for the software guys (and gals) who made quite a mess and made the CPU look bad. Regarding the apparently pretty strict division between even low-level software and hardware at Samsung: Do you think that is part of the problem? Even the best micro-arch can only work as well as the software that runs it allows for. Don't micro-arch + low-level software teams usually work closely together starting at the design stage? How is that handled at Intel, AMD, Qualcomm, Nvidia?Wardrive86 - Monday, August 20, 2018 - link
The flops you stated are double precision? 12 SP Flops/clockWardrive86 - Monday, August 20, 2018 - link
Is there only one 128 bit NEON unit in the M3?Andrei Frumusanu - Tuesday, August 21, 2018 - link
All of them are 128b. It's single precision Flops.Wardrive86 - Tuesday, August 21, 2018 - link
Thank you for your response. I suppose I should have asked are there 3 128bit (6 64 bit ALU) NEON units? Is the FPU VFPv5?Wardrive86 - Tuesday, August 21, 2018 - link
Ah NVM didn't see the SIMD blocks below the FMAC blocks, my bad. Should be able to Vector FMA right up to 24 SP flops/clock in theory/never in actual workloads. What a beast!!Trifrost - Tuesday, August 21, 2018 - link
NEON is a 128 bit SIMD viewed as 2x64 bit ALUs. It looks like 3x64 bit ALUs if you compare to the M1 block diagram. Max 12 flops if that is truebobcov - Tuesday, August 21, 2018 - link
This article desperately needs an editor. Could not take it seriously enough to finish reading it. "Productised?" Really? What's next, "seriousity?"Andrei Frumusanu - Tuesday, August 21, 2018 - link
That's literally the term taken out of the presentation, furthermore;https://dictionary.cambridge.org/dictionary/englis...
https://en.oxforddictionaries.com/definition/produ...
overzealot - Tuesday, August 21, 2018 - link
Great article, as always. Heavy on the technical aspects, just like we like it.He's not wrong about the fact that it would benefit from an editor, though. You'd get some easy wins by passing it through a grammar checker if there's no-one available to proof read your articles.
Also, if the page used a font where you can differentiate between lower case L and capitol i (l/I) it would make a lot of terms easier to parse.
While I was reading I made a list of text replacements that would improve readability.
The list is way too large for a comment field, so I'm sending it via email.
Dunatis - Tuesday, August 21, 2018 - link
Was this article written by an English speaker or translated through google translate or something. Apart from the other bits, this is truly the worst sentence I have read in a long while: We’ve exclusively first reported on the details of the new microarchitecture back in January and it was clear from that point on that this was a big one: Samsung had went for a huge push in terms of performance, resulting in one of the biggest generational jumps of any silicon CPU designer in recent history.darkich - Tuesday, August 21, 2018 - link
I don't see a problem?Looks like you either have basic comprehension issues, or English language understanding issues.
sonofgodfrey - Tuesday, August 21, 2018 - link
So, I don't have the best grammar in the world, but I think this is better:We first reported on the details of the new microarchitecture exclusively back in January, and it was clear from that point on that this a big one. Samsung had gone for a huge performance push, resulting in one of the biggest generational jumps of any CPU design in recent history.
sonofgodfrey - Tuesday, August 21, 2018 - link
that this a big one - That should be: that this was a big one.name99 - Wednesday, August 22, 2018 - link
I suspect Andrei speaks rather better German or French than you do.You might still think you're living in the 1950s, but the rest of us are well aware that it's the 21st Century; and it's just basic good manners to be grateful that so many smart people in the world are willing to speak English, rather than complaining that they don't do so perfectly.
abufrejoval - Friday, August 24, 2018 - link
Probably need to add Romanian and Lëtzebuergesch.darkich - Tuesday, August 21, 2018 - link
"Here’s to hoping S.LSI and SARC resolve the Exynos 9810 and M3’s weaknesses"Can you please clarify in layman terms ..does this mean there is a chance that Note 9's Exynos could be ironed out compared to S9 one you tested?
Btw, have to admit the depth and thoroughness of your articles are pretty mind blowing.
Andrei Frumusanu - Tuesday, August 21, 2018 - link
No, it refers to fixing the inherent hardware limitations in the M4 next year.jimjamjamie - Tuesday, August 21, 2018 - link
"µarch width and µarch width are complementary to each other for performance"I mean I guess this isn't incorrect, but I don't think that's what you meant to say, Andrei :)
lilmoe - Tuesday, August 21, 2018 - link
The M3 should have just been an iterative improvement over the M2, as the M2 was over the M1. The M3 was rushed in design, execution and process node. So much untapped potential.Oh well... I guess the M3 was a good beta for optimizing the M4 further, while retaining its competitiveness with Qualcomm and Apple in real world usage. I have no doubt the M4 will be best in class, but I just can't hide the fact that the M3 left a bitter taste in my mouth; it should have been a clear win from all aspects.
eastcoast_pete - Tuesday, August 21, 2018 - link
Knowing that this might rub some here the wrong way: While I agree that the style and grammar of Andrei's article could have been improved by an editor or proofreader, I mainly read articles here at Anandtech for their content, not style or grammar. As far as I can tell, Andrei's article was not littered with wrong or confusing statements or facts; conversely, it was an (obviously) quickly written, yet accurate summary of a very recent presentation at Hot Chips with some helpful background added.So, while I agree that a quick proofreading, preferably by another pair of eyes, would have
likely avoided some issues, I for one prefer high quality, accurate breaking news in occasionally imperfect English over perfect prose with missing or wrong facts and weeks late.
So, can we get back to the technology, which is what brings us here?
Speedfriend - Wednesday, August 22, 2018 - link
Exactly, if you don't like the grammar or style, go somewhere else...and stop being pedanticjospoortvliet - Wednesday, August 29, 2018 - link
Agreed. His articles are among the very best in terms of technical detail yet clarity and that is all I care for.abufrejoval - Friday, August 24, 2018 - link
As you hint, the power consumption disadvantages might not be as significant in a bigger form factor, such as an Slimbook (can't say Ultrabook, right?) or a tablet. Even the Kirin 950 (or was it 960?) which burned way too much in a phone seems to be adequate enough in tablets, where the larger displays tend to offset what the SoC eats in batteries.I
abufrejoval - Friday, August 24, 2018 - link
wonder if they'd sell suprplus chips cheap :-)want edit!