Comments Locked

39 Comments

Back to Article

  • xinthius - Sunday, June 2, 2013 - link

    It is a shame that Swift isn't included.
  • shompa - Sunday, June 2, 2013 - link

    Swift won't be licensed/used in non Apple hardware. So even if Swift was 1000000 faster, it would not help any of us.
  • melgross - Monday, June 3, 2013 - link

    Except for the large percentage of us here who do use Apple hardware.
  • codedivine - Sunday, June 2, 2013 - link

    Author here. Unfortunately I don't have an iPhone, nor a Mac to be able to develop the app for the iPhone. So I couldn't test it. Might look at Swift in the future.
  • bersl2 - Sunday, June 2, 2013 - link

    Can you imagine if Intel had never openly published the x86 instruction set architecture and rarely talked about its microarchitecture?

    Now consider the fact that extreme secrecy as to how to best interact with a big chip or with chipsets is the norm in the hardware world. It's horrible, and computing would be so much better if hardware companies actually talked with software developers openly. Because frankly, they suck at software, including firmware.
  • dishayu - Sunday, June 2, 2013 - link

    Wow, Krait 400 and A15 are really quite close... No wonder the 2 GS4 variants (1.6Ghz A15 vs 1.9Ghz Krait 400) have similar performance.
  • dishayu - Sunday, June 2, 2013 - link

    I meant krait *300

    krait 400 comes with the snapdragon 800 processors of course.
  • phoenix_rizzen - Sunday, June 2, 2013 - link

    Yeah, looks like Krait 300+ hit Qualcomm's targets of "nearly the performance of an A15 at the power levels of an A9" (give or take a bit). I'm very impressed by the S4 Pro SoC in my Optimus G.
  • ifIhateOnAppleCanIbeCoolToo - Sunday, June 2, 2013 - link

    Krait 300's had very limiting memory read/write performance that makes them firmly last-gen in non-synthetic benchmarks compared to A15's.
  • npp - Sunday, June 2, 2013 - link

    Very nice article, would love to see more like this one. I really feel Anandtech should maintain its focus on low-level architecture details alongside the more consumer oriented reviews.
  • codedivine - Sunday, June 2, 2013 - link

    Author here. Thanks for your kind words :)
  • tipoo - Sunday, June 2, 2013 - link

    Thanks for this, I find this very interesting as the floating point performance of ARM chips is now very relevant since so many games are starting to run on ARM platforms, and floating point is the predominant type of math done in games (vs integer).

    I'd be curious to see where a Jaguar core would fall in this (to estimate the XBone and PS4), as well as a PowerPC 750 (wii u) although the latter would be harder to find. ARM cores seem to be closing in on the performance of the low end x86 cores, even if Jaguar is still quite a ways ahead, I wonder how different the FP performance is.
  • codedivine - Sunday, June 2, 2013 - link

    Author here. Jaguar throughput is discussed in the article discussion. Summary: 3 fp64 flops/cycle, 8 fp32 flops/cycle.
  • Wilco1 - Sunday, June 2, 2013 - link

    Here are the Geekbench results of Jaguar vs A15: http://browser.primatelabs.com/geekbench2/compare/...

    On FP A15 wins by a good margin. On integer Jaguar is slightly faster.
  • tipoo - Sunday, June 2, 2013 - link

    That's unexpected. I would have thought the Jaguar would lead in almost every situation, being higher power.
  • Wilco1 - Monday, June 3, 2013 - link

    Remember A15 is 3-way OoO, supports 1 load and 1 store per cycle and has very wide issue, so it can easily leave Jaguar behind on compute intensive code as the results show. However Jaguar wins on memory intensive code due to its larger L2 and faster memory system.
  • aliasfox - Monday, June 3, 2013 - link

    If historical Mac G3 benchmarks are anything to go by, I don't think the PPC 750 will be much faster at floating point than the best of ARM.

    Apple used the PPC750 and called it the G3 back in the day. New ones are higher clocked, more power efficient, and maybe more/faster cache, but should be fundamentally the same. Assuming this, one could be able to extrapolate synthetic benches based on scaling cores and frequency, no?
  • DanNeely - Sunday, June 2, 2013 - link

    Where's atom stand in the mix? I think it would be a useful datapoint since Intel is positioning the Atom against ARM based systems.
  • Wilco1 - Monday, June 3, 2013 - link

    IIRC Atom has similar peak FP capabilities as Cortex-A9, however actual performance is far lower. Eg. 1.4GHz Cortex-A9 wins most single threaded FP benchmarks against a 2GHz Z2480: http://browser.primatelabs.com/geekbench2/compare/...

    This also shows how far behind Atom is compared with last-generation phones. Intel needs Silvermont desperately to try to close the gap.
  • watersb - Sunday, June 2, 2013 - link

    Excellent work!

    I wonder if GPU-based floating point will see more rapid adoption in mobile space.
  • oc3an - Sunday, June 2, 2013 - link

    How did you account for time spent not running your benchmark, i.e. when the OS is servicing interrupts or switched to a different task?
  • codedivine - Sunday, June 2, 2013 - link

    Well it is difficult to measure them. But I do not think those were significant issues in this test.
  • phoenix_rizzen - Sunday, June 2, 2013 - link

    If you are running your app via Android, consider installing Diagnosis Pro. It will allow you to add an overlay that shows you the exact frequency of each individual core, as polled every X seconds. Alternatively, it can just log the data to it's internal database for later export.

    Works quite nicely on an Optimus G (quad-core Snapdragon S4 Pro SoC).

    I've been using to test how well different CPU governess and hot plug CPU drivers work.
  • codedivine - Sunday, June 2, 2013 - link

    Thanks for the tip! I will look into it!
  • ChronoReverse - Tuesday, June 4, 2013 - link

    Yeah, the thermal throttling on the Krait devices is very aggressive (I'm currently using a hack on my GS4 to stop it because I like benchmarks).

    Overhead on Android is also pretty high. With your RgbenchMM, the difference on my GS4 is 3000 vs 3400 if I go ahead and kill tasks firsts.
  • whyso - Sunday, June 2, 2013 - link

    Would really be nice to see
    1) jaguar
    2) Atom
    3) Ivy bridge
    in the mix. (Though of course the test would have to be coded differently).
  • codedivine - Sunday, June 2, 2013 - link

    One needs to be careful while comparing instruction throughput across ISAs, because instructions on different ISAs are not equivalent. However, certainly looking into it.
  • Marat Dukhan - Sunday, June 2, 2013 - link

    This is not an artificial benchmark, but it gets close to peak: https://twitter.com/YepppLibrary/status/3414033653...
  • nakul02 - Sunday, June 2, 2013 - link

    Check out this paper:
    Gaurav Mitra, Beau Johnston, Alistair P. Rendell, Eric McCreath, Jun Zhou. Use of SIMD Vector Operations to Accelerate Application Code Performance on Low-Powered ARM and Intel Platforms, Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2013 IEEE 27th International. IEEE, 2013.
  • ZeDestructor - Monday, June 3, 2013 - link

    So they try to (IMO), but there's only so many architecture launches every generation so you kinda have to do the more consumer-focused stuff to fill in the gap.
  • skiboysteve - Monday, June 3, 2013 - link

    My work is going to be using cortex a9 for a project soon and that team is deciding on NEON vs vFPU3. Can you comment on the precision and performance tradeoffs?

    thanks for the great article!
  • Wilco1 - Monday, June 3, 2013 - link

    Neon supports 32-bit float only, but with Neon A9 can do 2 FMACs per cycle rather than 1 with VFP. There is no tradeoff in precision if your code already uses 32-bit floats (Neon flushes denormals to zero by default, with VFP you can choose - either way, it doesn't affect any real code).
  • eiriklf - Monday, June 3, 2013 - link

    Is there any chance to see the scores from a third krait 200 device, for instance a krait based one x, GS3 or Optimus G? I know all of those devices have about 3x the performance of the nexus 4 in linpack pr. core, so I would love to know if you found a difference with your script.
  • srihari - Monday, June 3, 2013 - link

    can you compare with Intel ? i understand you have Neon instructions in your test but, x86 Vs ARM will be good comparision.
  • srihari - Monday, June 3, 2013 - link

    Performance is not the only criteria to compare. i would conclude Krait 300 clearly leads Considering performance+power.
  • banvetor - Wednesday, June 5, 2013 - link

    Great article, thanks for the work. Looking forward for more in the series... :)
  • Parhelion69 - Wednesday, June 5, 2013 - link

    Could you update this article with numbers from Exynos 5 octa, from the SGS IV?

    I've run some benchmarks and its A15 seems like quite a beast
    Antutu 28086, CPU float-point: 5923
    javascript:
    sunspider: 652 ms
    kraken: 6392ms
    Riabench focus: 1468 ms

    I don't have geekbench but found these numbers:
    http://browser.primatelabs.com/geekbench2/2014946
    Geekbench score: 3598, floating point: 6168
  • Arkantus - Wednesday, June 19, 2013 - link

    Hello, just a dumb question: the article says "I count NEON addition and multiply as four flops and NEON MACs are counted as eight flops.", and the A9 Add(fp32 NEON) is rated for 1/2 flop/cycle.
    So this means that the Add(fp32 NEON) is slower than it's vfp counterpart? since for each cycle the neon version only perform half an operation according to this table.
    Thanks
  • [email protected] - Friday, June 27, 2014 - link

    Hey this is good stuff. Can anybody here help explain something for me though?

    I'm a database apps and integration guy, not formally trained and just starting to get interested in this kind of low level stuff. I've just been reading up on DMips and wondering how they relate to flops.

    What I think I know so far:
    A flop is floating point calculation.
    The "ip" in "Mip" is an instruction so a broader term (is a flop a type of ip or does it take 2 ips to make a flop drop?)
    Instructions per second is about the rawest, most non-contextualised metric of computing power you can get. Flops are a close second.

    Squeezing more instructions out of a single CPU cycle is the hard problem. There aren't massive variances in what can be done in this regard. In the Krait 300 manages about 3.3 instructions per cycle which on 4 cores at 1.7Ghz works out at about 22 GigaIps (semi -source: http://investorshub.advfn.com/boards/read_msg.aspx...

    My question is, firstly why are GPUs seemingly never measured in DMips and CPUs rarely in flops?
    Secondly, would knowing the answer to the "firstly" explain why despite no huge variance in DMips/Mhz across different devices the top GPUs manage 1000x faster performance measured in than these ARM chips. They get tera not gigaflops whilst using a similar number of cores and lower frequency.

    Obviously the consume a tonne more power to do it so I know it's not something for nothing but what's the heart of that something when it comes to how much you can do in a cycle?

    Ah. It's just occurred to me. Is it that an "instruction" refers to an item in a linear thread but that just 1 of them in a GPU might include setting the RGB values for all pixels in a frame all at once? That's be a few million flops in parallel for one instruction?

    Hmmm, well the real world numbers don't add up for that but is that along the right lines? If so why are these Gigaflop numbers lower than Gigamips?

    Sorry it was a long one. It can be very hard to find an intermediate starting point when you google an advanced subject.

Log in

Don't have an account? Sign up now