That single-thread performance is extremely impressive. The multithreaded scaling is ugly, though. Back when N1 was announced, ARM seemed to think 1MB/core was a good spot for Neoverse LLC - I wonder why both Graviton and Altra are going for considerably less.
Scaling might not be optimal, but performance loses are to expected if you greatly reduce available cache. In the end, MT performance is still far ahead of competition.
You have to remember that the competition is not 64 cores, but 64v cpus. The difference is 60% or more. The Arm Graviton2 is being placed into the best possible light by this comparision.
I have to disagree. You seem to forget that the arm chip is cheaper. It’s an additional win if it manages to integrate more cores and yet still achieve a comparable single threaded performance. It’s not unfair to compare two products with one seeming to have a stat advantage from the start, if it’s still cheaper or costs the same. Why should a customer care?
L caches uses sram which needs 6 transistors per bit. So, every 1MB needs all least 48 millions transistors without counting transistors for the controller
Six months ago I lost my job and after that I was fortunate enough to stumble upon a great website which literally saved me• I started working for them online and in a short time after I've started averaging 15k a month••• icash68.coM
I go over the core/SMT topic in the article, it's only a problem from a hardware comparison aspect, but it's very much the correct comparison from a cloud product offering comparison. The value proposition also does not change depending on core count, the instances are priced at similar tiers.
I understand that, but consider everything boils down to just $/vCPU/hr, I think a discussion around the new Xeon Gold R is warranted. For example, the existing dual-socket Xeon Amazon is using can be substituted by the new 6248R for 60% lower price while providing a modest turbo and base frequency improvement at lower a slight TDP reduction versus the existing Platinum they have. Unless Amazon decides to pocket the saving, that would have a massive impact on the vCPU $ comparison.
Hyperscalers never pay full list price for their special SKUs, so comparisons to public new SKUs like the 6248R are not relevant.
We're happy to update the landscape once EC2 introduces newer generation instances, but for now, these are the current prices and costs for what's available today and in the next few months.
I'm confused. Either you can think that everything boils down to $/vCPU/hr, in which case the only thing that's relevant is what Amazon actually offer, or you can think that "a discussion around the 'new' Xeon Gold R is warranted". They're mutually exclusive.
Great write-up Andrei. One question (I hope I didn't miss the answer in the article). Does Amazon's chip come out in front in the cost analysis because Amazon decided to take a loss or overcharge the other options, or is it an organic difference where it's intrinsically better?
I suspect the TDP of this chip is likely in the 150 watt range. We also know nothing about the operating environment of any of the chips. For example, the chip is rated for DDR4 3200, but is it running at 3200 speeds? The EPYC chip likely is NOT. So many questions here...
That’s exactly why they reserved the entire hardware. If you run only a single workload on SMT, that single thread can use the entire core. That’s kind of the point of SMT.
I would love to know if the product line has split within Annapurna. In other words whether Graviton2 has, like previous Annapurna SoCs, some interesting support around storage and networking for use in future Nitro. It's possible Amazon has some behind the scenes work going on with CCIX for future machines. For example integrating their Inferentia chip more closely with the SoC.
Given the core count, it'd also be interesting to compare ML inference acceleration via fp16 and int8 dot product instructions per core vs use of GPU or Inferentia.
One small bit of feedback: with that CPU topology chart, the coloration seems a little off. A difference of +/- 1 yields very different shades of red and orange, but the same difference on the green side of the spectrum yields no discernible difference in color? Personally, I think all of the 200 +/- 5 values in the first topology chart should be an almost uniform sea of orange/red. The important thing is the 150 difference in latency, not the +/- 1 latency, and the noise in the colors distracts the reader from the primary distinction. A lower signal to noise ratio.
Also: what is the unit? nanoseconds? microseconds? milliseconds? I can’t figure it out, and it’s not labeled as far as I can tell.
My tin hat is telling me to be suspicious of Amazon's pricing here. When shopping for cloud computing, perf/$ becomes VERY alluring, but I have to wonder if Amazon is willing to let its Gravitron servers be a "loss leader," artificially lowering prices to get market share until Arm on server is well-established - before then raising prices to something closer to a economically sustainable number.
True, but then Amazon has to pay for the ARM license and 100% of the development/production costs. I would be very surprised if they managed to *make money* on the 1st couple Graviton generations (especially if you factor in having to buy Annapurna), since you'd need to say "of the $X generated by Graviton metal, $Y would have been spent on EC2 anyways, meaning $Z is our actual gain," and that's... probably too much to ask at this stage.
The costs you mention are nothing compared to what they pay right now with Intel or AMD with they 50% margins on top of the actual cost. IMO this initiative was born out of Intel's price increases from 2010 to now. By vertically integrating they have full control over the price structure and they have very good data on what kind of workloads are running so they can tailor the design.
IMO it was just a question of time until Amazon tried to vertically integrate this like they've done with shipping and lots of other stuff. Bezos is following the Robber Barron growth model.
Huh? AMD has a gross margin of 40%, true. But keep in mind AWS has a operating margin of 30%, that mean AWS has a even higher gross margin than AMD, comparable to AMD's server department. Do you know what that means? For $1 of expenditure in to chip manufacturing, AWS expects to earn as much as AMD does. And since AWS don't have the volume as far as chip goes, their gross margin for chip investment will be lower, therefore not worth the investment if the decision is purely financial.
But yes, the other point stands, AWS have better control of costing (with more leverage as well) and performance.
For every $1 worth of silicon you could pay AMD $1.50, pay Intel $2 or pay TSMC $1 plus $0.20 internal development costs. Which works out best you think?
It's not that simple. AMD and Intel can spread those development costs over vastly more processors. I mean we'll never know how it truly breaks down -- but I'd imagine Amazon has figure this all out and this will be pretty profitable for them.
Developing a chip based on a standard Arm core is much cheaper. Arm chip volumes are much higher than Intel and AMD, the costs are spread out over billions of chips.
ARM's licensing comparatively speaking is extremely cheap even for their most expensive N1 Core Blueprint. The development and production cost are largely on ARM's because of the platform model. So Amazon is only really paying for the cost to Fab with TSMC, I would be surprised if those chip cost more than $300. Which is at least a few thousand less than Intel or even AMD.
Amazon will have to paid for all the software cost though. Making sure all their tools, and software runs on ARM. That is very expensive in engineering cost, but paid off in long term.
Only the Wafer Cost alone would be $50+ assuming 100% yield. That is excluding licensing and additional R&D. At their volume I would not be surprised it stack up to $300
At scale, predictability is more important in infrastructure than cost. It may seem that if we have everything we need compiled for Arm, we can just switch over. But these things often look easier in theory than practice. I'd be wary to move existing service to Arm instances, or even starting a new one when I just want to iterate fast and just be sure that underlying level doesn't have any new surprises. It will be fine If I have time to experiment, or later, when the dust settles. Right now, I doubt that switching over to these instances once they are available, is actually easy or even smart decision.
"It may seem that if we have everything we need compiled for Arm, we can just switch over. But these things often look easier in theory than practice. "
with language compliant compilers, I don't buy that argument. it can certainly be true that RISC-ier processors yield larger binaries and slower performance, but real application failure has to be due to OS mismatches. C is the universal assembler.
Beware that in C struct packing is ABI dependent, if you write out a struct to disk on x86_64, and try and read it back in on Aarch64, you might have a bad time unless you use the packed pragma and use specified-width types. This is the sort of thing that might get you if you try to migrate between architectures.
Also many languages (including C) have hand optimised math libraries with inline assembler, which might still be using plain-C fallbacks on other architectures. There was a good article discussing the migration to Aarch64 at Cloudflare, they particulary encountered issues with go not being optimised on Aarch64 yet https://blog.cloudflare.com/arm-takes-wing/
It's funny you mention optimized math libraries. The reality is that Arm has freely published generic C math libraries which beat handwritten x86 assembler implementations: https://github.com/ARM-software/optimized-routines
The GLIBC version installed in Graviton 2 is relatively old, so doesn't have this new math code yet (while Android and LLVM libraries do), and this explains why GCC SPECFP scores are relatively low.
"I didn’t have a proper good multi-core bandwidth test available in my toolset (going to have to write one), so fell back to Timo Bingmann’s PMBW test for some quick numbers on the memory bandwidth scaling of the Graviton2."
Probably not - if you need a Windows image I would imagine they would push you towards the Intel or AMD service and not the ARM service - yes Windows Server runs on ARM but unless you were testing Windows applications / services specifically for ARM - there would be no benefit.
Will there be more articles on this, covering other workloads than SPEC? You see lots of academic and industry papers talking about how real cloud/hyperscaler/server workloads have deep software stacks with large instruction-side footprints and static branch footprints, whereas SPEC is really... not that. Those workloads tend to have lower IPC on all platforms, and it would be interesting to see how Graviton2 performs on those from the instruction-supply side of things (1 core) as well as how I-side bandwidth scales horizontally with thread counts given the coherent I-Cache.
Concrete suggestions in terms of workloads too look at and can be reasonably deployed are welcome- we currently don't have a well defined test suite for such things.
I mean an actual concrete example of such a structured benchmark, me going around doing random DB operations just opens up more criticism on why we didn't use test framework XYZ.
here's one: https://hammerdb.com/ don't know, perhaps likely, that you can get the source and compile for any db/OS of interest. didn't say it was simple. :)
It's just I'm hearing a lot of "we want something specific" without actually specifying anything, me doing some random workload myself that isn't validated in terms of characterisation isn't in my view any better than the well understood nature of SPEC.
Have you looked at the benchmarks in GCP PerfKitBenchmarker (https://github.com/GoogleCloudPlatform/PerfKitBenc... It includes benchmark versions of various popular benchmarks including variants of ycsb on different databases, oltp, cloudsuite, hadoop, and a bunch of wrapper infrastructure around running the tests on cloud providers.
The Apple CPU cores are larger and more power hungry when loaded hard than the CPU cores on the N1. A 64 CPU chip with the high performance cores from the Apple A13 would consume far more power than the N1 and would be quite a bit larger than the N1. The Apple A13 chip (in the iPhone 11) is suited for intermittent load not the sustained use that server type chips such as the N1 have to deal with.
The benchmarks feel incomplete. Why don't you have a 64-core Zen2 based processor in it to compare?
Even the ThreadRipper 64-core would be something.
But not having AMD's latest Server grade CPU in your benchmarks really feels like you're doing a disservice to your readers, especially since we've seen your previous reviews with the Zen 2 64 core monster.
Read the article! Rome is mentioned over five times. In short, Amazon doesn't offer Rome instances yet and Anandtech will update this article once they do.
They have same optimisations as first gen Zen APUs, i.e. Ryzen mobile 2xxx. Zen+ is a further developed architecture, albeit without further cache tweaks. The cache tweaks in question were meant to be included in the origina Zen, but didn't make it in time. As such one could argue that first gen Ryzen desktop is not full Zen (1), but a preview.
The fact that Amazon refused to grant access to Rome-based instances tells you everything you need to know. Graviton competes with Zen and Xeon, but is absolutely smoked by Zen 2 in both absolute terms and perf/watt.
It's a shame to see Amazon hide behind marketing bullshit to make its products seem relevant.
Don't be silly. Amazon buys processors in the thousands. There is no way AMD could have supplied enough Rome CPU's to Amazon to load up an instance at each of their locations in the time Rome has been for sale.
It typical takes about 6 months before Amazon gets instances online because AMD/Intel aren't going to give Amazon the entire production run for the first 3 months. They've got about 20 data centers and you'd probably need several hundered per data center to bring an instance up.
Consider the cost and scale of building that out before you criticize them for not having the latest and greatest released a month a go. Rome hasn't been available to actually purchase for very long and the Cloud providers get special models and AMD still needs to supply everyone else as well.
While I am currently not in the market for such cloud computing services aside from maybe some video processing, I for one welcome the arrival of a competitive non-x86 solution! Can only make life better and cheaper when and if I do. Also, ARM N1 arch lighting a fire under the x86 makers in their easy chairs will keep AMD and Intel on their feet, and that advance will filter down to my future desktops and laptops.
Thanks Andrei! Just out of curiosity, that "noisy neighbor" behavior you saw on the Xeon? I know it's mostly speculation, but would you expect this if someone is running AVX512 on neighboring cores? AVX512 is very powerful if applications can make use of it, but things get very toasty fast. Care to speculate?
Agreed 100%. Without figures of actual real-world applications compiled with actual real-world compilers handling actual real-world workloads, this essentially amounts to an advertorial for Amazon, Graviton2 and Arm.
This may sound stupid as I'm just getting into AWS as backup throughput for local servers on my web project that releases April.
"If you’re an EC2 customer today, and unless you’re tied to x86 for whatever reason, you’d be stupid not to switch over to Graviton2 instances once they become available, as the cost savings will be significant."
How do you know whether what you're using is Intel, AMD or Graviton(1/2)? (I'm using T2s right now with no weighting and if our release gets hit hard, will give it weight and and increase its capacity).
As they're not actually doing anything, then I'd have no issue switching over, but can't tell what I'm on.
No real benchmark. Another SPEC Whiteknighting. I see the AT forums Apple CPU thread being getting creamed over this again.
ARM is a lockdown POS. You can't even buy them in this case. Altera CPU didn't even came to STH for comparision where it had so many cores against x86 parts. You cannot get them running majority of the consumer workload. One can claim Power from IBM has SMT8 and first Gen4 and all but if its not consumer centric it won't generate much of profit.
Author seems to love ARM for some reason and hate x86. Its been since Apple articles but in real time we saw how iPhone gets decimated in speed comparison against Android Flagships running the stone age Qualcomm. We have seen this ARM dethroning x86 numerous times and failed. I hope this also fails, a non standard CPU leaves all fun out of equation. And needs emulation for consumer use which slows down performance.
People want to see all the workloads. Not SPEC. Also where is EPYC Rome comparision Nowhere. Soon Milan is going to hit. Glad that AMD is alive. This stupid ARM BGA dumpster should be dead in its infancy.
110W is very pessimistic, and would make no sense at all, considering that the ryzen 9 3900x uses 105W at 12 cores 24 threads at 4.6Ghz and 7nm, and the 3950 does the same with 4 more cores. Plus, regular arm based (AMLogic) boxes use 3Watt in total under load (that includes CPU+Ethernet+RAM+Emmc) for 4 CPU cores running at 1,9Ghz. If you ask me, 64 core arm CPUs running at 2Ghz should run at around just over 1 watt per core, making it a 65W tdp chip
"I recently had the time to write a new custom microbenchmark for testing synchronisation latencies of CPU cores, exhibiting some of the cache-coherency as well as physical layouts of current designs."
Wow, and what a benchmark that turned out to be. Please consider packaging it and releasing it. Or giving us the code so we can run it. I would really love to run that test on a few of my machines. I am frustrated with current benchmarks on this area also, and you seem to have built the perfect solution.
"Overall, it's a bit odd to see GCC ahead in that many workloads given that LLVM the is the primary compiler for billions of Arm devices in the mobile space." Extra "the": "Overall, it's a bit odd to see GCC ahead in that many workloads given that LLVM is the primary compiler for billions of Arm devices in the mobile space."
There's a major flaw in the price comparison - why did they take m5n (which has additional network quota) instead of regular m5? It would be $3.07 instead $3.808
Maybe I'm missing something, but the SPEC numbers seem a little low compared to published results. For example, an Intel Xeon Platinum 8260 scores around 280ish for 48 cores on SPEC INT RATE 2017. This chip is pretty similar to an 8259CL, except that the 8259CL has a slightly higher frequency at 2.5 GHz vs 2.4 GHz for the 8260.
The m5n.16xlarge has 32 cores. (32/48) * 280 = 187.67. Your result was 157.36; about 83% of my guess. Granted, performance will probably not scale exactly linearly and there may be a little virtualization overhead, but that drop still seems a little steep.
I'm trying to replicate your PMBW bandwidth numbers on the AWS with a C6G instance, but I seem to be getting lower BW estimates - ~170 GB/s for the scalar reads (64-bit) and ~160 GB/s for scalar writes for 64 threads. I've tried both 64GB and 1 GB as the test sizes (the -s and -S parameters of PMBW). Could you confirm the test sizes and/or command-lines used for your results? Thanks.
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
96 Comments
Back to Article
SarahKerrigan - Tuesday, March 10, 2020 - link
That single-thread performance is extremely impressive. The multithreaded scaling is ugly, though. Back when N1 was announced, ARM seemed to think 1MB/core was a good spot for Neoverse LLC - I wonder why both Graviton and Altra are going for considerably less.shing3232 - Tuesday, March 10, 2020 - link
it's gonna costly(die and power wise) to build a interconnect for 64C with good performance. by the time, it would lost its power/perf edge I suppose.Tabalan - Tuesday, March 10, 2020 - link
Scaling might not be optimal, but performance loses are to expected if you greatly reduce available cache. In the end, MT performance is still far ahead of competition.ballsystemlord - Thursday, March 12, 2020 - link
You have to remember that the competition is not 64 cores, but 64v cpus. The difference is 60% or more. The Arm Graviton2 is being placed into the best possible light by this comparision.ballsystemlord - Thursday, March 12, 2020 - link
I mean 60% for the cores that are actually 1 thread. As in, the performance boost by turning on SMT is 40% best case scenario.autarchprinceps - Sunday, October 25, 2020 - link
I have to disagree. You seem to forget that the arm chip is cheaper. It’s an additional win if it manages to integrate more cores and yet still achieve a comparable single threaded performance. It’s not unfair to compare two products with one seeming to have a stat advantage from the start, if it’s still cheaper or costs the same. Why should a customer care?zamroni - Thursday, March 12, 2020 - link
L caches uses sram which needs 6 transistors per bit.So, every 1MB needs all least 48 millions transistors without counting transistors for the controller
dianajmclean6 - Monday, March 23, 2020 - link
Six months ago I lost my job and after that I was fortunate enough to stumble upon a great website which literally saved me• I started working for them online and in a short time after I've started averaging 15k a month••• icash68.coMRallJ - Tuesday, March 10, 2020 - link
Comparisons made are to the whole core performance of Graviton to just thread performance of Xeon/EPYC. It's very problematic.Also TDP rating for the graviton is off by 50% based on what was reported at re:Invent.
Andrei Frumusanu - Tuesday, March 10, 2020 - link
I go over the core/SMT topic in the article, it's only a problem from a hardware comparison aspect, but it's very much the correct comparison from a cloud product offering comparison. The value proposition also does not change depending on core count, the instances are priced at similar tiers.eek2121 - Tuesday, March 10, 2020 - link
It is worth noting AnandTech’s own numbers: https://www.anandtech.com/show/14694/amd-rome-epyc...RallJ - Tuesday, March 10, 2020 - link
I understand that, but consider everything boils down to just $/vCPU/hr, I think a discussion around the new Xeon Gold R is warranted. For example, the existing dual-socket Xeon Amazon is using can be substituted by the new 6248R for 60% lower price while providing a modest turbo and base frequency improvement at lower a slight TDP reduction versus the existing Platinum they have. Unless Amazon decides to pocket the saving, that would have a massive impact on the vCPU $ comparison.https://www.anandtech.com/show/15542/intel-updates...
Andrei Frumusanu - Tuesday, March 10, 2020 - link
Hyperscalers never pay full list price for their special SKUs, so comparisons to public new SKUs like the 6248R are not relevant.We're happy to update the landscape once EC2 introduces newer generation instances, but for now, these are the current prices and costs for what's available today and in the next few months.
Spunjji - Wednesday, March 11, 2020 - link
I'm confused. Either you can think that everything boils down to $/vCPU/hr, in which case the only thing that's relevant is what Amazon actually offer, or you can think that "a discussion around the 'new' Xeon Gold R is warranted". They're mutually exclusive.close - Tuesday, March 10, 2020 - link
Great write-up Andrei. One question (I hope I didn't miss the answer in the article). Does Amazon's chip come out in front in the cost analysis because Amazon decided to take a loss or overcharge the other options, or is it an organic difference where it's intrinsically better?Andrei Frumusanu - Tuesday, March 10, 2020 - link
We have no idea of Amazon's internal cost structure, so take the cost analysis from and end-user TCO perspective.eek2121 - Tuesday, March 10, 2020 - link
I suspect the TDP of this chip is likely in the 150 watt range. We also know nothing about the operating environment of any of the chips. For example, the chip is rated for DDR4 3200, but is it running at 3200 speeds? The EPYC chip likely is NOT. So many questions here...Andrei Frumusanu - Tuesday, March 10, 2020 - link
It is running 3200, Amazon confirmed that.They didn't comment on TDP, but given Arm and Ampere's figures, I think my estimate is correct.
Flunk - Thursday, April 9, 2020 - link
They're comparing VMs with the same cost/hour. What number of cores/threads is isn't really relevant.autarchprinceps - Sunday, October 25, 2020 - link
That’s exactly why they reserved the entire hardware. If you run only a single workload on SMT, that single thread can use the entire core. That’s kind of the point of SMT.notladca - Tuesday, March 10, 2020 - link
I would love to know if the product line has split within Annapurna. In other words whether Graviton2 has, like previous Annapurna SoCs, some interesting support around storage and networking for use in future Nitro. It's possible Amazon has some behind the scenes work going on with CCIX for future machines. For example integrating their Inferentia chip more closely with the SoC.Given the core count, it'd also be interesting to compare ML inference acceleration via fp16 and int8 dot product instructions per core vs use of GPU or Inferentia.
coder543 - Tuesday, March 10, 2020 - link
One small bit of feedback: with that CPU topology chart, the coloration seems a little off. A difference of +/- 1 yields very different shades of red and orange, but the same difference on the green side of the spectrum yields no discernible difference in color? Personally, I think all of the 200 +/- 5 values in the first topology chart should be an almost uniform sea of orange/red. The important thing is the 150 difference in latency, not the +/- 1 latency, and the noise in the colors distracts the reader from the primary distinction. A lower signal to noise ratio.Also: what is the unit? nanoseconds? microseconds? milliseconds? I can’t figure it out, and it’s not labeled as far as I can tell.
Andrei Frumusanu - Tuesday, March 10, 2020 - link
Nanoseconds, I'll add a remark.sing_electric - Tuesday, March 10, 2020 - link
My tin hat is telling me to be suspicious of Amazon's pricing here. When shopping for cloud computing, perf/$ becomes VERY alluring, but I have to wonder if Amazon is willing to let its Gravitron servers be a "loss leader," artificially lowering prices to get market share until Arm on server is well-established - before then raising prices to something closer to a economically sustainable number.Andrei Frumusanu - Tuesday, March 10, 2020 - link
Vertical integration is powerful. Amazon can share profits and margins division wide, not having to pay overhead to AMD/Intel.sing_electric - Tuesday, March 10, 2020 - link
True, but then Amazon has to pay for the ARM license and 100% of the development/production costs. I would be very surprised if they managed to *make money* on the 1st couple Graviton generations (especially if you factor in having to buy Annapurna), since you'd need to say "of the $X generated by Graviton metal, $Y would have been spent on EC2 anyways, meaning $Z is our actual gain," and that's... probably too much to ask at this stage.rahvin - Tuesday, March 10, 2020 - link
The costs you mention are nothing compared to what they pay right now with Intel or AMD with they 50% margins on top of the actual cost. IMO this initiative was born out of Intel's price increases from 2010 to now. By vertically integrating they have full control over the price structure and they have very good data on what kind of workloads are running so they can tailor the design.IMO it was just a question of time until Amazon tried to vertically integrate this like they've done with shipping and lots of other stuff. Bezos is following the Robber Barron growth model.
dotjaz - Wednesday, March 11, 2020 - link
Huh? AMD has a gross margin of 40%, true. But keep in mind AWS has a operating margin of 30%, that mean AWS has a even higher gross margin than AMD, comparable to AMD's server department.Do you know what that means? For $1 of expenditure in to chip manufacturing, AWS expects to earn as much as AMD does. And since AWS don't have the volume as far as chip goes, their gross margin for chip investment will be lower, therefore not worth the investment if the decision is purely financial.
But yes, the other point stands, AWS have better control of costing (with more leverage as well) and performance.
Wilco1 - Wednesday, March 11, 2020 - link
For every $1 worth of silicon you could pay AMD $1.50, pay Intel $2 or pay TSMC $1 plus $0.20 internal development costs. Which works out best you think?extide - Friday, March 13, 2020 - link
It's not that simple. AMD and Intel can spread those development costs over vastly more processors. I mean we'll never know how it truly breaks down -- but I'd imagine Amazon has figure this all out and this will be pretty profitable for them.Wilco1 - Friday, March 13, 2020 - link
Developing a chip based on a standard Arm core is much cheaper. Arm chip volumes are much higher than Intel and AMD, the costs are spread out over billions of chips.ksec - Tuesday, March 10, 2020 - link
ARM's licensing comparatively speaking is extremely cheap even for their most expensive N1 Core Blueprint. The development and production cost are largely on ARM's because of the platform model. So Amazon is only really paying for the cost to Fab with TSMC, I would be surprised if those chip cost more than $300. Which is at least a few thousand less than Intel or even AMD.Amazon will have to paid for all the software cost though. Making sure all their tools, and software runs on ARM. That is very expensive in engineering cost, but paid off in long term.
extide - Friday, March 13, 2020 - link
Actual production cost is going to be more like $50 or so. WAY less than $300.ksec - Monday, March 30, 2020 - link
Only the Wafer Cost alone would be $50+ assuming 100% yield. That is excluding licensing and additional R&D. At their volume I would not be surprised it stack up to $300FunBunny2 - Tuesday, March 10, 2020 - link
"Vertical integration is powerful."I find it amusing that compute folks are reinventing the wheel from Henry Ford!! River Rouge.
mrvco - Tuesday, March 10, 2020 - link
It would be interesting to see how the AWS instances compare to performance-competitive Azure instances on a value basis.kliend - Tuesday, March 10, 2020 - link
Anecdotally, Yes. Amazon is always trying to bring in users for little/no immediate profit.skaurus - Tuesday, March 10, 2020 - link
At scale, predictability is more important in infrastructure than cost. It may seem that if we have everything we need compiled for Arm, we can just switch over. But these things often look easier in theory than practice. I'd be wary to move existing service to Arm instances, or even starting a new one when I just want to iterate fast and just be sure that underlying level doesn't have any new surprises.It will be fine If I have time to experiment, or later, when the dust settles. Right now, I doubt that switching over to these instances once they are available, is actually easy or even smart decision.
FunBunny2 - Tuesday, March 10, 2020 - link
"It may seem that if we have everything we need compiled for Arm, we can just switch over. But these things often look easier in theory than practice. "with language compliant compilers, I don't buy that argument. it can certainly be true that RISC-ier processors yield larger binaries and slower performance, but real application failure has to be due to OS mismatches. C is the universal assembler.
mm0zct - Wednesday, March 11, 2020 - link
Beware that in C struct packing is ABI dependent, if you write out a struct to disk on x86_64, and try and read it back in on Aarch64, you might have a bad time unless you use the packed pragma and use specified-width types. This is the sort of thing that might get you if you try to migrate between architectures.Also many languages (including C) have hand optimised math libraries with inline assembler, which might still be using plain-C fallbacks on other architectures. There was a good article discussing the migration to Aarch64 at Cloudflare, they particulary encountered issues with go not being optimised on Aarch64 yet https://blog.cloudflare.com/arm-takes-wing/
Wilco1 - Wednesday, March 11, 2020 - link
It's funny you mention optimized math libraries. The reality is that Arm has freely published generic C math libraries which beat handwritten x86 assembler implementations: https://github.com/ARM-software/optimized-routinesThe GLIBC version installed in Graviton 2 is relatively old, so doesn't have this new math code yet (while Android and LLVM libraries do), and this explains why GCC SPECFP scores are relatively low.
senttoschool - Tuesday, March 10, 2020 - link
Can we conclude that ARM is going to destroy AMD and Intel in the server space within the next 5 years?RSAUser - Tuesday, March 10, 2020 - link
No, but they're going to reduce the excessive margins.rogerdpack - Monday, February 14, 2022 - link
Wish they'd release it to more than just datacenters though...jeffsci - Tuesday, March 10, 2020 - link
"I didn’t have a proper good multi-core bandwidth test available in my toolset (going to have to write one), so fell back to Timo Bingmann’s PMBW test for some quick numbers on the memory bandwidth scaling of the Graviton2."The canonical benchmark for memory bandwidth, which supports OpenMP for multithreading, is McCalpin's STREAM (https://www.cs.virginia.edu/stream/).
Andrei Frumusanu - Tuesday, March 10, 2020 - link
I'm not a big fan of it, particularly because of OMP, one can do much better.kliend - Tuesday, March 10, 2020 - link
I have a question I did not find addressed in the article.Will Amazon/AWS offer this instance in Linux only or do they also run Windows?
Andrei Frumusanu - Tuesday, March 10, 2020 - link
The preview images are all Linux, I'm not aware of their plans on Windows.Korguz's Mom - Tuesday, March 10, 2020 - link
Probably not - if you need a Windows image I would imagine they would push you towards the Intel or AMD service and not the ARM service - yes Windows Server runs on ARM but unless you were testing Windows applications / services specifically for ARM - there would be no benefit.Korguz - Wednesday, March 11, 2020 - link
FYI, my mom died of cancer 4 years ago, i hope you are happy and proud of your self. you are scumanonomouse - Tuesday, March 10, 2020 - link
Will there be more articles on this, covering other workloads than SPEC? You see lots of academic and industry papers talking about how real cloud/hyperscaler/server workloads have deep software stacks with large instruction-side footprints and static branch footprints, whereas SPEC is really... not that. Those workloads tend to have lower IPC on all platforms, and it would be interesting to see how Graviton2 performs on those from the instruction-supply side of things (1 core) as well as how I-side bandwidth scales horizontally with thread counts given the coherent I-Cache.Andrei Frumusanu - Tuesday, March 10, 2020 - link
Concrete suggestions in terms of workloads too look at and can be reasonably deployed are welcome- we currently don't have a well defined test suite for such things.FunBunny2 - Tuesday, March 10, 2020 - link
"Concrete suggestions in terms of workloads"OLTP on RDBMS?? real one, of course, not MySql. :)
Andrei Frumusanu - Tuesday, March 10, 2020 - link
I mean an actual concrete example of such a structured benchmark, me going around doing random DB operations just opens up more criticism on why we didn't use test framework XYZ.FunBunny2 - Tuesday, March 10, 2020 - link
here's one: https://hammerdb.com/ don't know, perhaps likely, that you can get the source and compile for any db/OS of interest. didn't say it was simple. :)Andrei Frumusanu - Wednesday, March 11, 2020 - link
It's just I'm hearing a lot of "we want something specific" without actually specifying anything, me doing some random workload myself that isn't validated in terms of characterisation isn't in my view any better than the well understood nature of SPEC.anonomouse - Wednesday, March 11, 2020 - link
Have you looked at the benchmarks in GCP PerfKitBenchmarker (https://github.com/GoogleCloudPlatform/PerfKitBenc... It includes benchmark versions of various popular benchmarks including variants of ycsb on different databases, oltp, cloudsuite, hadoop, and a bunch of wrapper infrastructure around running the tests on cloud providers.anonomouse - Wednesday, March 11, 2020 - link
Okay so maybe the comment system doesn't have well with links:https://github.com/GoogleCloudPlatform/PerfKitBenc...
http://googlecloudplatform.github.io/PerfKitBenchm...
yeeeeman - Tuesday, March 10, 2020 - link
Ok, now imagine this chip with apple custom cores. Even Zen wouldn't stand a chance.HStewart - Tuesday, March 10, 2020 - link
You can't truly say that. Keep in mind both Apple and Amazon are aim at there own custom environments - things are like different in real world.Duncan Macdonald - Tuesday, March 10, 2020 - link
The Apple CPU cores are larger and more power hungry when loaded hard than the CPU cores on the N1. A 64 CPU chip with the high performance cores from the Apple A13 would consume far more power than the N1 and would be quite a bit larger than the N1. The Apple A13 chip (in the iPhone 11) is suited for intermittent load not the sustained use that server type chips such as the N1 have to deal with.arashi - Wednesday, March 11, 2020 - link
Yikesmanedsib1 - Tuesday, March 10, 2020 - link
You are using an Epyc processor that is nearly 3 years old.Surely you should use this years model (or a 64-corer threadripper if you dont have one)
vanilla_gorilla - Wednesday, March 11, 2020 - link
You should consider reading the article and then you would know exactly why they are using those CPU.Kamen Rider Blade - Tuesday, March 10, 2020 - link
The benchmarks feel incomplete. Why don't you have a 64-core Zen2 based processor in it to compare?Even the ThreadRipper 64-core would be something.
But not having AMD's latest Server grade CPU in your benchmarks really feels like you're doing a disservice to your readers, especially since we've seen your previous reviews with the Zen 2 64 core monster.
Rudde - Wednesday, March 11, 2020 - link
Read the article! Rome is mentioned over five times. In short, Amazon doesn't offer Rome instances yet and Anandtech will update this article once they do.Sahrin - Tuesday, March 10, 2020 - link
I may be remembering incorrectly, but doesn't Gen 1 Epyc have the same cache tweaks as Zen+ (ie, Epyc 7001 series is based on Zen+, not Zen)?Rudde - Wednesday, March 11, 2020 - link
They have same optimisations as first gen Zen APUs, i.e. Ryzen mobile 2xxx. Zen+ is a further developed architecture, albeit without further cache tweaks.The cache tweaks in question were meant to be included in the origina Zen, but didn't make it in time. As such one could argue that first gen Ryzen desktop is not full Zen (1), but a preview.
Sahrin - Tuesday, March 10, 2020 - link
The fact that Amazon refused to grant access to Rome-based instances tells you everything you need to know. Graviton competes with Zen and Xeon, but is absolutely smoked by Zen 2 in both absolute terms and perf/watt.It's a shame to see Amazon hide behind marketing bullshit to make its products seem relevant.
rahvin - Thursday, March 12, 2020 - link
Don't be silly. Amazon buys processors in the thousands. There is no way AMD could have supplied enough Rome CPU's to Amazon to load up an instance at each of their locations in the time Rome has been for sale.It typical takes about 6 months before Amazon gets instances online because AMD/Intel aren't going to give Amazon the entire production run for the first 3 months. They've got about 20 data centers and you'd probably need several hundered per data center to bring an instance up.
Consider the cost and scale of building that out before you criticize them for not having the latest and greatest released a month a go. Rome hasn't been available to actually purchase for very long and the Cloud providers get special models and AMD still needs to supply everyone else as well.
eastcoast_pete - Tuesday, March 10, 2020 - link
While I am currently not in the market for such cloud computing services aside from maybe some video processing, I for one welcome the arrival of a competitive non-x86 solution! Can only make life better and cheaper when and if I do. Also, ARM N1 arch lighting a fire under the x86 makers in their easy chairs will keep AMD and Intel on their feet, and that advance will filter down to my future desktops and laptops.eastcoast_pete - Tuesday, March 10, 2020 - link
Thanks Andrei! Just out of curiosity, that "noisy neighbor" behavior you saw on the Xeon? I know it's mostly speculation, but would you expect this if someone is running AVX512 on neighboring cores? AVX512 is very powerful if applications can make use of it, but things get very toasty fast. Care to speculate?willgart - Tuesday, March 10, 2020 - link
where are the real life benchmarks???video encoding / decoding ?
database performance ?
web performance ?
https encryption ?
etc...
The_Assimilator - Thursday, March 12, 2020 - link
Agreed 100%. Without figures of actual real-world applications compiled with actual real-world compilers handling actual real-world workloads, this essentially amounts to an advertorial for Amazon, Graviton2 and Arm.Danvelopment - Wednesday, March 11, 2020 - link
This may sound stupid as I'm just getting into AWS as backup throughput for local servers on my web project that releases April."If you’re an EC2 customer today, and unless you’re tied to x86 for whatever reason, you’d be stupid not to switch over to Graviton2 instances once they become available, as the cost savings will be significant."
How do you know whether what you're using is Intel, AMD or Graviton(1/2)? (I'm using T2s right now with no weighting and if our release gets hit hard, will give it weight and and increase its capacity).
As they're not actually doing anything, then I'd have no issue switching over, but can't tell what I'm on.
CampGareth - Wednesday, March 11, 2020 - link
There's a list here: https://aws.amazon.com/ec2/instance-types/If you're on T2 instances you're on Intel chips at the moment.
Quantumz0d - Wednesday, March 11, 2020 - link
No real benchmark. Another SPEC Whiteknighting. I see the AT forums Apple CPU thread being getting creamed over this again.ARM is a lockdown POS. You can't even buy them in this case. Altera CPU didn't even came to STH for comparision where it had so many cores against x86 parts. You cannot get them running majority of the consumer workload. One can claim Power from IBM has SMT8 and first Gen4 and all but if its not consumer centric it won't generate much of profit.
Author seems to love ARM for some reason and hate x86. Its been since Apple articles but in real time we saw how iPhone gets decimated in speed comparison against Android Flagships running the stone age Qualcomm. We have seen this ARM dethroning x86 numerous times and failed. I hope this also fails, a non standard CPU leaves all fun out of equation. And needs emulation for consumer use which slows down performance.
People want to see all the workloads. Not SPEC. Also where is EPYC Rome comparision Nowhere. Soon Milan is going to hit. Glad that AMD is alive. This stupid ARM BGA dumpster should be dead in its infancy.
Wilco1 - Wednesday, March 11, 2020 - link
LOL - someone feels extremely threatened by Arm servers...Mission accomplished!
anonomouse - Wednesday, March 11, 2020 - link
Well that was bizarrely incoherent. What workloads would you want to see instead? Nothing else you wrote made any sense or had any facts behind it.Andrei Frumusanu - Wednesday, March 11, 2020 - link
He's been doing it for the last year or two, ignore it.jbrower - Saturday, July 24, 2021 - link
Well at least you have a troll -- mark of success for authors, heheProDigit - Wednesday, March 11, 2020 - link
110W is very pessimistic, and would make no sense at all, considering that the ryzen 9 3900x uses 105W at 12 cores 24 threads at 4.6Ghz and 7nm, and the 3950 does the same with 4 more cores.Plus, regular arm based (AMLogic) boxes use 3Watt in total under load (that includes CPU+Ethernet+RAM+Emmc) for 4 CPU cores running at 1,9Ghz.
If you ask me, 64 core arm CPUs running at 2Ghz should run at around just over 1 watt per core, making it a 65W tdp chip
Andrei Frumusanu - Wednesday, March 11, 2020 - link
There's 64 PCIe4 lanes and 8 memory controllers in there as well.cdome - Wednesday, March 11, 2020 - link
Quick question. Does Graviton2 have support for SVE2 vector extension? if yes how wide are execution units? thank youAndrei Frumusanu - Wednesday, March 11, 2020 - link
No, there's 2x128b v8 ASIMD/NEON pipes.Soulkeeper - Wednesday, March 11, 2020 - link
What was used to generate the images on page 2 ?ie: https://images.anandtech.com/doci/15578/AMD-Epyc-6...
Is this app/source available to download ?
Thanks
sharath.naik - Wednesday, March 11, 2020 - link
Whats behind the name Annapurna? The name is Indian in origin but the company is Israeli.nijimon - Thursday, March 12, 2020 - link
Judging by the logo it could be referring to the massif in the Himalayas.https://en.wikipedia.org/wiki/Annapurna_Massif
Andy Chow - Thursday, March 12, 2020 - link
"I recently had the time to write a new custom microbenchmark for testing synchronisation latencies of CPU cores, exhibiting some of the cache-coherency as well as physical layouts of current designs."Wow, and what a benchmark that turned out to be. Please consider packaging it and releasing it. Or giving us the code so we can run it. I would really love to run that test on a few of my machines. I am frustrated with current benchmarks on this area also, and you seem to have built the perfect solution.
ballsystemlord - Thursday, March 12, 2020 - link
1 Grammar error:"Overall, it's a bit odd to see GCC ahead in that many workloads given that LLVM the is the primary compiler for billions of Arm devices in the mobile space."
Extra "the":
"Overall, it's a bit odd to see GCC ahead in that many workloads given that LLVM is the primary compiler for billions of Arm devices in the mobile space."
imaskar - Friday, June 12, 2020 - link
There's a major flaw in the price comparison - why did they take m5n (which has additional network quota) instead of regular m5? It would be $3.07 instead $3.808BlueLikeYou - Tuesday, September 1, 2020 - link
Maybe I'm missing something, but the SPEC numbers seem a little low compared to published results. For example, an Intel Xeon Platinum 8260 scores around 280ish for 48 cores on SPEC INT RATE 2017. This chip is pretty similar to an 8259CL, except that the 8259CL has a slightly higher frequency at 2.5 GHz vs 2.4 GHz for the 8260.The m5n.16xlarge has 32 cores. (32/48) * 280 = 187.67. Your result was 157.36; about 83% of my guess. Granted, performance will probably not scale exactly linearly and there may be a little virtualization overhead, but that drop still seems a little steep.
sgovindan - Friday, June 25, 2021 - link
Hi Andrei,I'm trying to replicate your PMBW bandwidth numbers on the AWS with a C6G instance, but I seem to be getting lower BW estimates - ~170 GB/s for the scalar reads (64-bit) and ~160 GB/s for scalar writes for 64 threads. I've tried both 64GB and 1 GB as the test sizes (the -s and -S parameters of PMBW). Could you confirm the test sizes and/or command-lines used for your results? Thanks.