Comments Locked

29 Comments

Back to Article

  • Eden-K121D - Thursday, August 25, 2016 - link

    So when are we getting a deepdive on SD 820 and Exynos 8890
  • hans_ober - Thursday, August 25, 2016 - link

    The power analysis with the Intel x86 SoC, Tegra 3 and Snapdragon from some years back needs to be repeated..
  • ddriver - Thursday, August 25, 2016 - link

    Sure, because it must be rehashed "how superior" atom is to arm, which would explain intel's domination complete and utter domination of the mobile device market.... oh what, that's right, that's not a thing.

    Wonder about the gaps in the last chart, did they censor some stuff too good to know about?

    On a side note, while performance is meh, efficiency seems pretty good, they should make a 16 core version of it!
  • frenchy_2001 - Thursday, August 25, 2016 - link

    The gaps are obviously for legibility, as they are between individual tests, type and overall.
    They seem to be a bit more powerful than A57 (about 10%) overall, while being about 25% more efficient. Would be nice to compare with A72, as A57 was rather power hungry.
  • Ariknowsbest - Friday, August 26, 2016 - link

    After two weeks with my new phone the A72 (Kirin 950) seems very efficient, compared to the S808 (work phone) or S801 (own). Wasn't impressed by the A57 so I stayed longer with 32-bit on Android.
  • Gasaraki88 - Wednesday, August 31, 2016 - link

    A lot of products are superior to others yet don't do well in the markets due to various reasons. Just because Atom is dead doesn't mean is wasn't a good chip. (Beta vs. VHS, etc)
  • RaduR - Thursday, August 25, 2016 - link

    If they could only start using PowerVR instead of Mali then we could have a real competition to Qualcomm SD. Unfortunately even midrange SD 650 is has faster Graphics than Mali.
    CPU is not enough they need strong GPU and PowerVR looks like the only viable option to compete with QComm
  • ZenX - Thursday, August 25, 2016 - link

    Exynos 8890 could've matched SD820 in graphics if they used the top of the line Mali T880-MP16, but instead Samsung decided to shelve 4 cores and end up with MP12 instead
  • mczak - Thursday, August 25, 2016 - link

    The Mali T880-MP12 in the Exynos 8890 already has to throttle like crazy in any kind of sustained load (about half the peak performance). It is true generally wider but lower clocked designs are more power efficient, but compared to other chips using Mali T8xx graphic the max graphic clock is already pretty low, so it seems unlikely efficiency would be better with lower clocks. Thus, all a MP16 version would achieve would be higher (useless) peak performance.
    IMHO Samsung could have just used a MP8 version instead with very little loss of actual real world performance (but the same is true for the Adreno 530 too, a somewhat smaller configuration would achieve the same practical performance by not needing to throttle that much).
  • lilmoe - Thursday, August 25, 2016 - link

    Stop looking too much into benchmarks. Mali has been faster and more efficient for a while in real world usage.
  • osxandwindows - Thursday, August 25, 2016 - link

    not.
  • osxandwindows - Thursday, August 25, 2016 - link

    In real world usage, the powerVR gt7600 is the fastest, followed by the QC…
  • MrSpadge - Thursday, August 25, 2016 - link

    Which application of general relevance are you referring to, which requires an "ultra high end" GPU in a smartphone instead of a "high end" GPU?
  • mercucu1111 - Thursday, August 25, 2016 - link

    Uh... did you say adreno 510 is better than Mali-T880 MP12?

    Here are the datas.

    https://gfxbench.com/device.jsp?D=Samsung+Galaxy+A...

    https://gfxbench.com/device.jsp?D=Samsung+Galaxy+N...

    In offscreen tests, T880 is about 3 times better than A510(SD650, 652).

    http://images.anandtech.com/graphs/graph10559/8337...

    Here is the other data. In basemark offscreen test, T880 MP12 is over than A530(SD820).
  • Lochheart - Thursday, August 25, 2016 - link

    Come on... It s a benchmark, it's not relevant in real world usage.
    Do a gamebench, see how it doesnt work like that.
  • darkich - Friday, August 26, 2016 - link

    Mali T880 MP 12 is more efficient than Adreno 530. Lower peak performance, but better sustained performance and lower power consumption.
  • Infy2 - Thursday, August 25, 2016 - link

    Me wants comparison between Kryo, A72 and M1. Something like performance at same clock speeds and perf/W and suistained performance.
  • Meteor2 - Friday, August 26, 2016 - link

    Me too! Let's get to the heart of it. Of course, sustained performance is chassis-dependent...
  • markiz - Friday, August 26, 2016 - link

    Why would you want that, if those are not configurations that you can find in a product you can actually buy and use?

    I think tests AS IS are far more useful.
  • Meteor2 - Friday, August 26, 2016 - link

    Because they're interesting, and illustrate the state-of-the-art?
  • jayfang - Thursday, August 25, 2016 - link

    So if I'm reading this (tricky) graph right, overall a M1 uses more power than A57 110% but is more efficient 120%. That puts it in a bracket of about same perf, but worse power than A72. Also no talk of die size?

    How much did this effort cost them, because so far the ROI is questionable.
  • Meteor2 - Friday, August 26, 2016 - link

    Indeed. But it's a work in progress -- they must feel that there's more to come.
  • name99 - Thursday, August 25, 2016 - link

    " I’m no expert here but it looks like this branch predictor also has the ability to do multiple branch predictions at the same time, either as a sort of multi-level branch predictor or handling multiple successive branches. Perceptron branch prediction isn't exactly new in academia or in real-world CPUs, but it's interesting to see that this is specifically called out when most companies are reluctant to disclose such matters."

    The usual situation with branch prediction (on modern high-end CPUs) is that the fetch engine predicts
    - the next address of interest AND
    - how far along from that next address to load instructions.
    Instruction load ends at the smallest of
    - the end of the cache line OR
    - the maximum width of the bus from the I-cache into the I-queue (which Samsung tells us is 6 instructions) OR
    - the next branch point (which is known because those instructions were tagged as such when the line was first pulled into the I-cache)

    NOW suppose that the third case holds AND that the predictor predicts that this next branch point is NOT taken. Then it is obviously reasonable to extend this fetch through that branch and on till either the first two conditions hold, or yet another branch point is encountered.

    This is the simplest way to extend branch prediction to utilize two predictions per cycle, and is certainly good enough for a two-wide machine. I am sure that's what Samsung is doing.
    (IBM does this already with POWER8 if not earlier. My guess is that Apple also already do it. It makes more sense to do this the wider your CPU becomes.)

    As for perceptrons:
    - It's not clear that they are the absolute best possible. The last time a competition was held against various predictors (given a fixed transistor budget, etc) the winner was the TAGE predictor. I've seen various vague hints that current Intel uses TAGE-like predictors, and A9 has branch accuracy comparable to Intel, suggesting Apple also use it.
    https://hal.inria.fr/hal-00639193/document

    Samsung is not alone here. Zen also uses Perceptron, and it will be interesting once it is released to see how it compares with Intel on interpreters and similar code that subjects predictors to extreme stress.
  • Eden-K121D - Friday, August 26, 2016 - link

    A hybrid approach would have been much better
  • name99 - Thursday, August 25, 2016 - link

    More interesting is what is omitted.
    There appears not to be a decoded-operations cache, just a loop buffer. Such a cache does not increase performance directly, but does allow for substantial (maybe 20%) power reduction.

    I'm also guessing they're not doing memory speculation (ie executing loads even when there are prior stores with unresolved addresses). Such speculation is surprisingly worthwhile in terms of performance boost, but requires extra machinery to recover when the speculation goes wrong. Intel added it a while ago, and we know Apple does it based on a lawsuit by UW-Madison (which also sued Intel but settled before trial --- presumably when Samsung adds this they'll first pay UW-Madison, but it's Samsung so who knows :-) ).

    And as always, more and more of the performance is driven by details that are not captured in the numbers that are released. What's the quality of the prefetchers, and the branch predictors? What's the quality of the memory controller? What algorithms are used to decide on which cache lines to replace? etc etc

    In terms of sophistication it looks to me just slightly beyond Apple's Swift (though obviously running at higher frequency). It obviously is 4 wide rather than 3 wide, and adds the 64-bit ISA, but those are fairly mechanical additions.
    That's not a criticism --- everyone has to start somewhere --- but I think it calibrates the extent of the achievement, and the extent of the gap remaining. What will be interesting will be to see whether they add sophistication as fast as Apple did. (Most obviously: clustering of two 3-wide units in Cyclone, fixing up random weak spots in branch prediction, the multipliers, the FPUs, the caches in Typhoon, dramatically improved memory controller in Twister).
  • MrCommunistGen - Thursday, August 25, 2016 - link

    name99: I really appreciate your analysis.

    For those of us who aren't completely immersed in the intricacies of mobile CPU architecture the additional insight is really interesting. We've all seen how M1 performs compared to A72 and S820. Seeing block diagrams and a description of M1 (in the original article) is interesting, but without the added context of how it compares to other architectures this interest was - at least to me - more academic.

    You've added to the topic and in doing so I feel you've bettered my understanding of the subject.
  • saratoga4 - Thursday, August 25, 2016 - link

    > which can do a floating point multiply-accumulate operation every 5 cycles

    Are you sure that is every 5 cycles and not 5 cycle latency? I think the FP unit would be pipelined.
  • Andrei Frumusanu - Thursday, August 25, 2016 - link

    Correct it's 1 MAC per cycle with 5 cycle latency.
  • darkich - Friday, August 26, 2016 - link

    I have a question for Joshua and Andrei, not keeping high hopes I'll get the answer but still.. Which design you like better, the M1 or Cortex A73?

Log in

Don't have an account? Sign up now