Author here. Unfortunately I don't have an iPhone, nor a Mac to be able to develop the app for the iPhone. So I couldn't test it. Might look at Swift in the future.
Can you imagine if Intel had never openly published the x86 instruction set architecture and rarely talked about its microarchitecture?
Now consider the fact that extreme secrecy as to how to best interact with a big chip or with chipsets is the norm in the hardware world. It's horrible, and computing would be so much better if hardware companies actually talked with software developers openly. Because frankly, they suck at software, including firmware.
Yeah, looks like Krait 300+ hit Qualcomm's targets of "nearly the performance of an A15 at the power levels of an A9" (give or take a bit). I'm very impressed by the S4 Pro SoC in my Optimus G.
Very nice article, would love to see more like this one. I really feel Anandtech should maintain its focus on low-level architecture details alongside the more consumer oriented reviews.
Thanks for this, I find this very interesting as the floating point performance of ARM chips is now very relevant since so many games are starting to run on ARM platforms, and floating point is the predominant type of math done in games (vs integer).
I'd be curious to see where a Jaguar core would fall in this (to estimate the XBone and PS4), as well as a PowerPC 750 (wii u) although the latter would be harder to find. ARM cores seem to be closing in on the performance of the low end x86 cores, even if Jaguar is still quite a ways ahead, I wonder how different the FP performance is.
Remember A15 is 3-way OoO, supports 1 load and 1 store per cycle and has very wide issue, so it can easily leave Jaguar behind on compute intensive code as the results show. However Jaguar wins on memory intensive code due to its larger L2 and faster memory system.
If historical Mac G3 benchmarks are anything to go by, I don't think the PPC 750 will be much faster at floating point than the best of ARM.
Apple used the PPC750 and called it the G3 back in the day. New ones are higher clocked, more power efficient, and maybe more/faster cache, but should be fundamentally the same. Assuming this, one could be able to extrapolate synthetic benches based on scaling cores and frequency, no?
IIRC Atom has similar peak FP capabilities as Cortex-A9, however actual performance is far lower. Eg. 1.4GHz Cortex-A9 wins most single threaded FP benchmarks against a 2GHz Z2480: http://browser.primatelabs.com/geekbench2/compare/...
This also shows how far behind Atom is compared with last-generation phones. Intel needs Silvermont desperately to try to close the gap.
If you are running your app via Android, consider installing Diagnosis Pro. It will allow you to add an overlay that shows you the exact frequency of each individual core, as polled every X seconds. Alternatively, it can just log the data to it's internal database for later export.
Works quite nicely on an Optimus G (quad-core Snapdragon S4 Pro SoC).
I've been using to test how well different CPU governess and hot plug CPU drivers work.
One needs to be careful while comparing instruction throughput across ISAs, because instructions on different ISAs are not equivalent. However, certainly looking into it.
Check out this paper: Gaurav Mitra, Beau Johnston, Alistair P. Rendell, Eric McCreath, Jun Zhou. Use of SIMD Vector Operations to Accelerate Application Code Performance on Low-Powered ARM and Intel Platforms, Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2013 IEEE 27th International. IEEE, 2013.
So they try to (IMO), but there's only so many architecture launches every generation so you kinda have to do the more consumer-focused stuff to fill in the gap.
My work is going to be using cortex a9 for a project soon and that team is deciding on NEON vs vFPU3. Can you comment on the precision and performance tradeoffs?
Neon supports 32-bit float only, but with Neon A9 can do 2 FMACs per cycle rather than 1 with VFP. There is no tradeoff in precision if your code already uses 32-bit floats (Neon flushes denormals to zero by default, with VFP you can choose - either way, it doesn't affect any real code).
Is there any chance to see the scores from a third krait 200 device, for instance a krait based one x, GS3 or Optimus G? I know all of those devices have about 3x the performance of the nexus 4 in linpack pr. core, so I would love to know if you found a difference with your script.
Could you update this article with numbers from Exynos 5 octa, from the SGS IV?
I've run some benchmarks and its A15 seems like quite a beast Antutu 28086, CPU float-point: 5923 javascript: sunspider: 652 ms kraken: 6392ms Riabench focus: 1468 ms
Hello, just a dumb question: the article says "I count NEON addition and multiply as four flops and NEON MACs are counted as eight flops.", and the A9 Add(fp32 NEON) is rated for 1/2 flop/cycle. So this means that the Add(fp32 NEON) is slower than it's vfp counterpart? since for each cycle the neon version only perform half an operation according to this table. Thanks
Hey this is good stuff. Can anybody here help explain something for me though?
I'm a database apps and integration guy, not formally trained and just starting to get interested in this kind of low level stuff. I've just been reading up on DMips and wondering how they relate to flops.
What I think I know so far: A flop is floating point calculation. The "ip" in "Mip" is an instruction so a broader term (is a flop a type of ip or does it take 2 ips to make a flop drop?) Instructions per second is about the rawest, most non-contextualised metric of computing power you can get. Flops are a close second.
Squeezing more instructions out of a single CPU cycle is the hard problem. There aren't massive variances in what can be done in this regard. In the Krait 300 manages about 3.3 instructions per cycle which on 4 cores at 1.7Ghz works out at about 22 GigaIps (semi -source: http://investorshub.advfn.com/boards/read_msg.aspx...
My question is, firstly why are GPUs seemingly never measured in DMips and CPUs rarely in flops? Secondly, would knowing the answer to the "firstly" explain why despite no huge variance in DMips/Mhz across different devices the top GPUs manage 1000x faster performance measured in than these ARM chips. They get tera not gigaflops whilst using a similar number of cores and lower frequency.
Obviously the consume a tonne more power to do it so I know it's not something for nothing but what's the heart of that something when it comes to how much you can do in a cycle?
Ah. It's just occurred to me. Is it that an "instruction" refers to an item in a linear thread but that just 1 of them in a GPU might include setting the RGB values for all pixels in a frame all at once? That's be a few million flops in parallel for one instruction?
Hmmm, well the real world numbers don't add up for that but is that along the right lines? If so why are these Gigaflop numbers lower than Gigamips?
Sorry it was a long one. It can be very hard to find an intermediate starting point when you google an advanced subject.
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
39 Comments
Back to Article
xinthius - Sunday, June 2, 2013 - link
It is a shame that Swift isn't included.shompa - Sunday, June 2, 2013 - link
Swift won't be licensed/used in non Apple hardware. So even if Swift was 1000000 faster, it would not help any of us.melgross - Monday, June 3, 2013 - link
Except for the large percentage of us here who do use Apple hardware.codedivine - Sunday, June 2, 2013 - link
Author here. Unfortunately I don't have an iPhone, nor a Mac to be able to develop the app for the iPhone. So I couldn't test it. Might look at Swift in the future.bersl2 - Sunday, June 2, 2013 - link
Can you imagine if Intel had never openly published the x86 instruction set architecture and rarely talked about its microarchitecture?Now consider the fact that extreme secrecy as to how to best interact with a big chip or with chipsets is the norm in the hardware world. It's horrible, and computing would be so much better if hardware companies actually talked with software developers openly. Because frankly, they suck at software, including firmware.
dishayu - Sunday, June 2, 2013 - link
Wow, Krait 400 and A15 are really quite close... No wonder the 2 GS4 variants (1.6Ghz A15 vs 1.9Ghz Krait 400) have similar performance.dishayu - Sunday, June 2, 2013 - link
I meant krait *300krait 400 comes with the snapdragon 800 processors of course.
phoenix_rizzen - Sunday, June 2, 2013 - link
Yeah, looks like Krait 300+ hit Qualcomm's targets of "nearly the performance of an A15 at the power levels of an A9" (give or take a bit). I'm very impressed by the S4 Pro SoC in my Optimus G.ifIhateOnAppleCanIbeCoolToo - Sunday, June 2, 2013 - link
Krait 300's had very limiting memory read/write performance that makes them firmly last-gen in non-synthetic benchmarks compared to A15's.npp - Sunday, June 2, 2013 - link
Very nice article, would love to see more like this one. I really feel Anandtech should maintain its focus on low-level architecture details alongside the more consumer oriented reviews.codedivine - Sunday, June 2, 2013 - link
Author here. Thanks for your kind words :)tipoo - Sunday, June 2, 2013 - link
Thanks for this, I find this very interesting as the floating point performance of ARM chips is now very relevant since so many games are starting to run on ARM platforms, and floating point is the predominant type of math done in games (vs integer).I'd be curious to see where a Jaguar core would fall in this (to estimate the XBone and PS4), as well as a PowerPC 750 (wii u) although the latter would be harder to find. ARM cores seem to be closing in on the performance of the low end x86 cores, even if Jaguar is still quite a ways ahead, I wonder how different the FP performance is.
codedivine - Sunday, June 2, 2013 - link
Author here. Jaguar throughput is discussed in the article discussion. Summary: 3 fp64 flops/cycle, 8 fp32 flops/cycle.Wilco1 - Sunday, June 2, 2013 - link
Here are the Geekbench results of Jaguar vs A15: http://browser.primatelabs.com/geekbench2/compare/...On FP A15 wins by a good margin. On integer Jaguar is slightly faster.
tipoo - Sunday, June 2, 2013 - link
That's unexpected. I would have thought the Jaguar would lead in almost every situation, being higher power.Wilco1 - Monday, June 3, 2013 - link
Remember A15 is 3-way OoO, supports 1 load and 1 store per cycle and has very wide issue, so it can easily leave Jaguar behind on compute intensive code as the results show. However Jaguar wins on memory intensive code due to its larger L2 and faster memory system.aliasfox - Monday, June 3, 2013 - link
If historical Mac G3 benchmarks are anything to go by, I don't think the PPC 750 will be much faster at floating point than the best of ARM.Apple used the PPC750 and called it the G3 back in the day. New ones are higher clocked, more power efficient, and maybe more/faster cache, but should be fundamentally the same. Assuming this, one could be able to extrapolate synthetic benches based on scaling cores and frequency, no?
DanNeely - Sunday, June 2, 2013 - link
Where's atom stand in the mix? I think it would be a useful datapoint since Intel is positioning the Atom against ARM based systems.Wilco1 - Monday, June 3, 2013 - link
IIRC Atom has similar peak FP capabilities as Cortex-A9, however actual performance is far lower. Eg. 1.4GHz Cortex-A9 wins most single threaded FP benchmarks against a 2GHz Z2480: http://browser.primatelabs.com/geekbench2/compare/...This also shows how far behind Atom is compared with last-generation phones. Intel needs Silvermont desperately to try to close the gap.
watersb - Sunday, June 2, 2013 - link
Excellent work!I wonder if GPU-based floating point will see more rapid adoption in mobile space.
oc3an - Sunday, June 2, 2013 - link
How did you account for time spent not running your benchmark, i.e. when the OS is servicing interrupts or switched to a different task?codedivine - Sunday, June 2, 2013 - link
Well it is difficult to measure them. But I do not think those were significant issues in this test.phoenix_rizzen - Sunday, June 2, 2013 - link
If you are running your app via Android, consider installing Diagnosis Pro. It will allow you to add an overlay that shows you the exact frequency of each individual core, as polled every X seconds. Alternatively, it can just log the data to it's internal database for later export.Works quite nicely on an Optimus G (quad-core Snapdragon S4 Pro SoC).
I've been using to test how well different CPU governess and hot plug CPU drivers work.
codedivine - Sunday, June 2, 2013 - link
Thanks for the tip! I will look into it!ChronoReverse - Tuesday, June 4, 2013 - link
Yeah, the thermal throttling on the Krait devices is very aggressive (I'm currently using a hack on my GS4 to stop it because I like benchmarks).Overhead on Android is also pretty high. With your RgbenchMM, the difference on my GS4 is 3000 vs 3400 if I go ahead and kill tasks firsts.
whyso - Sunday, June 2, 2013 - link
Would really be nice to see1) jaguar
2) Atom
3) Ivy bridge
in the mix. (Though of course the test would have to be coded differently).
codedivine - Sunday, June 2, 2013 - link
One needs to be careful while comparing instruction throughput across ISAs, because instructions on different ISAs are not equivalent. However, certainly looking into it.Marat Dukhan - Sunday, June 2, 2013 - link
This is not an artificial benchmark, but it gets close to peak: https://twitter.com/YepppLibrary/status/3414033653...nakul02 - Sunday, June 2, 2013 - link
Check out this paper:Gaurav Mitra, Beau Johnston, Alistair P. Rendell, Eric McCreath, Jun Zhou. Use of SIMD Vector Operations to Accelerate Application Code Performance on Low-Powered ARM and Intel Platforms, Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2013 IEEE 27th International. IEEE, 2013.
ZeDestructor - Monday, June 3, 2013 - link
So they try to (IMO), but there's only so many architecture launches every generation so you kinda have to do the more consumer-focused stuff to fill in the gap.skiboysteve - Monday, June 3, 2013 - link
My work is going to be using cortex a9 for a project soon and that team is deciding on NEON vs vFPU3. Can you comment on the precision and performance tradeoffs?thanks for the great article!
Wilco1 - Monday, June 3, 2013 - link
Neon supports 32-bit float only, but with Neon A9 can do 2 FMACs per cycle rather than 1 with VFP. There is no tradeoff in precision if your code already uses 32-bit floats (Neon flushes denormals to zero by default, with VFP you can choose - either way, it doesn't affect any real code).eiriklf - Monday, June 3, 2013 - link
Is there any chance to see the scores from a third krait 200 device, for instance a krait based one x, GS3 or Optimus G? I know all of those devices have about 3x the performance of the nexus 4 in linpack pr. core, so I would love to know if you found a difference with your script.srihari - Monday, June 3, 2013 - link
can you compare with Intel ? i understand you have Neon instructions in your test but, x86 Vs ARM will be good comparision.srihari - Monday, June 3, 2013 - link
Performance is not the only criteria to compare. i would conclude Krait 300 clearly leads Considering performance+power.banvetor - Wednesday, June 5, 2013 - link
Great article, thanks for the work. Looking forward for more in the series... :)Parhelion69 - Wednesday, June 5, 2013 - link
Could you update this article with numbers from Exynos 5 octa, from the SGS IV?I've run some benchmarks and its A15 seems like quite a beast
Antutu 28086, CPU float-point: 5923
javascript:
sunspider: 652 ms
kraken: 6392ms
Riabench focus: 1468 ms
I don't have geekbench but found these numbers:
http://browser.primatelabs.com/geekbench2/2014946
Geekbench score: 3598, floating point: 6168
Arkantus - Wednesday, June 19, 2013 - link
Hello, just a dumb question: the article says "I count NEON addition and multiply as four flops and NEON MACs are counted as eight flops.", and the A9 Add(fp32 NEON) is rated for 1/2 flop/cycle.So this means that the Add(fp32 NEON) is slower than it's vfp counterpart? since for each cycle the neon version only perform half an operation according to this table.
Thanks
[email protected] - Friday, June 27, 2014 - link
Hey this is good stuff. Can anybody here help explain something for me though?I'm a database apps and integration guy, not formally trained and just starting to get interested in this kind of low level stuff. I've just been reading up on DMips and wondering how they relate to flops.
What I think I know so far:
A flop is floating point calculation.
The "ip" in "Mip" is an instruction so a broader term (is a flop a type of ip or does it take 2 ips to make a flop drop?)
Instructions per second is about the rawest, most non-contextualised metric of computing power you can get. Flops are a close second.
Squeezing more instructions out of a single CPU cycle is the hard problem. There aren't massive variances in what can be done in this regard. In the Krait 300 manages about 3.3 instructions per cycle which on 4 cores at 1.7Ghz works out at about 22 GigaIps (semi -source: http://investorshub.advfn.com/boards/read_msg.aspx...
My question is, firstly why are GPUs seemingly never measured in DMips and CPUs rarely in flops?
Secondly, would knowing the answer to the "firstly" explain why despite no huge variance in DMips/Mhz across different devices the top GPUs manage 1000x faster performance measured in than these ARM chips. They get tera not gigaflops whilst using a similar number of cores and lower frequency.
Obviously the consume a tonne more power to do it so I know it's not something for nothing but what's the heart of that something when it comes to how much you can do in a cycle?
Ah. It's just occurred to me. Is it that an "instruction" refers to an item in a linear thread but that just 1 of them in a GPU might include setting the RGB values for all pixels in a frame all at once? That's be a few million flops in parallel for one instruction?
Hmmm, well the real world numbers don't add up for that but is that along the right lines? If so why are these Gigaflop numbers lower than Gigamips?
Sorry it was a long one. It can be very hard to find an intermediate starting point when you google an advanced subject.