But it can tow the rear drive only Xeon uphill with an only slightly obvious rolling start so it's clearly better until Intel requests Amazon send a system over to them for an "apples-to-apples" compairson.
Not being X64 is great. Why use fake 64bit extensions that need a 32bit CPU core to work when you can use real 64bit and remove the whole 32bit CPU block and save energy and die space.
what do you mean by " use fake 64bit extensions that need a 32bit CPU core to work when you can use real 64bit" all modern x86 programs are compiled for 64 bit anyway
It looks like they're running the multi-threaded benchmark across all vCPUs then dividing by the number of vCPUs. So it's a renormalized multicore figure:
There's no kickstarting ARM server usage. Amazon is MASSIVE. Basically if they move their hosted services to their own internal ARM servers, ARM server usage could go from 0 to 50% overnight. Its like the Costco analogy. Whatever Amazon uses internally for their storage, networking, CPU, would become one of the largest in the world overnight.
I heard their internal network chip is like 3rd biggest in the world in volume after the big guys because they use so much networking hardware internally. Even without external customers they've got the volume to skew the numbers.
>ARM server usage could go from 0 to 50% overnight
You're clearly exaggerating. Pretty sure Google+FB have a bigger share of the market driving server sales in the Enterprise arena, they aren't 50% combined so AWS can't possibly be more than 50% on their own. Heck YouTube alone might be consuming more storage+chips (server) than AWS.
Well it wont be 50%, but i know it is big. I missed the announcement I thought this is going to be like their A1 instances, turns out they intend to have all of the SaaS offer, ( DB, SMS, Mail, DNS Whatever it is ) to be running on ARM.
But HyperScaler together ( That is Alibaba, Amazon, Google, Facebook, etc ) together owns 50%+ of the market shipment, and AWS were estimated at 50% of Hyperscaler, so that is likely 20 - 25% of Intel DC revenue vanishing.
Sure, Amazon could compare to what they *estimate* a 10 nm Ice Lake server chip could deliver, but that just adds more variable into the mix. There's value in comparing to a known quantity (i.e. a current instance, whether their own or a competitive one).
Anyway, such apples-to-apples comparisons will certainly be made, once both types of instances are actually available.
Xeon: 1.5 MB of L1, 24 MB of L2, and 33 MB of L3 cache. Graviton: ? MB of L1, 64 MB of L2, 32 MB of L3 cache.
24 cores vs 64 cores.
6 memory controllers vs 8.
48 PCIe lanes vs 64.
And so on. There's other blocks in the CPU die as well that aren't compared here (media, AVX, etc) It's not that hard to figure out why one has more transistors than another.
In terms of throughput, the correct comparison point for one Neoverse vCPU is two Purley vCPUs, because AFAIK a vCPU is added per hard context on EC2. Based on that, a Purley core is still considerably higher throughput than an N1 core.
I suspect single-thread is close, or at least would be at base clocks; their mature turbo implementation continues to be a strong point for Intel. I also expect the Graviton2 chip's perf/W to be far better than the Xeon it is compared to.
Sure, but the Neoverse cores are much smaller so you get 2.5 times the cores. If you're interested in throughput you need to compare total throughput per chip rather than per core. According to the SPECINT score a 24-core Skylake-SP gets only half the throughput of one Graviton2 chip, so you need 2 of them.
The Platinum 8175 AWS m5 instances have a 3.1GHz all-core turbo (https://en.wikichip.org/wiki/intel/xeon_platinum/8... so getting ~95% of single-threaded performance of the Skylake at its max turbo is pretty impressive!
Ultimate multicore performance for single SoC x86 is being limited by dark silicon on Intel 14nm. For a 64 Core Intel monster - they need their (Intel) 7nm process - or a multi SoC solution.
When TSMC’s 5nm ovens are ready is ready - Amazon will be able to ARMs next Cores that will close the per Core performance gap - but allow considerably more cores before bottlenecking occurs,
A 128 Core Arm Poseidon SoC on TSMC 5nm could very well eclipse a 64 Core Intel CPU bakes on Intel 7nm - but cost Amazon a fraction of the cost.
When TSMC's 5nm is ready AMD's future Zen cores will curb stomp anything ARM can offer, like they already do.
Language is a funny thing, ``New Generation of ARM-based instances powered by AWS Graviton2 processors offer 40% better price/performance than current x86-based instances.''
A. That's 40% over previous Graviton processor nodes. BFD. B. Our upcoming x86-based instances drastically knee cap our current x86-based instances in price/performance but we won't say that as we're trying to sell our own schtick here.
how per vCPU calculated is totally over my head. Is it total-score on intel processor divided by the number of hardware thread (96, 2 * 48 threads/socket) compared against ARM processor score divided by (128, 2* 64threads/socket) ?
How many people buying AWS services care about latency rather than throughput? Sure, you need to hit a minimum per-core performance level, but once that's achieved what matters is the throughput/dollar (including eg rack volume and watts).
Judging a design like this by metrics appropriate to the desktop is just silly.
It doesn't matter, you get 1 thread on Intel vCPU, you get 1 Core per ARM vCPU . The unit are the same. Not to mention a lot of Clients and Workload likes to have HT disabled.
As long as the 1 ARM vCPU is cheaper, ( which it is ), and provides comparable performance ( which it does, according to AWS it is 30% faster than a Single Thread 3.1Ghz Skylake ) then it is all that matters.
The numbers seem a bit strange, Andrei. I assume we all agree that, while this is a nice step forward in the ARM server space, the individual cores are no Lightning's. So let's look at area; TSMC 7nm so basically like with like:
IF one chip has 32 cores (per yesterday's article) then one core (+support ie L3 etc) is ~10mm^2. Meanwhile Apple is about 16mm^2 (eyeballing it as about 1/6th of the die for 2 large+small cores,+ L2s + large system cache). So Apple seems to be getting a LOT more out of their die... Even put aside the small cores and their per big core (+LOTS of cache) is ~8mm^2.
Of course DRAM PHYs take some space, but mainly around the edges. So possibilities? - 64 cores on the die, not 32? AND/OR - LOTS of IO? A few ethernet phy's, some flash controllers, some USB and PCIe? - lost of the die devoted to GPU/NPU?
The only way I can square it is likely all three are true. Half the die is IO+GPU/NPU (which gets us to 5mm^2/core) AND there are actually 64 cores? WikiChip says an N1+L2 is supposed to be around 1.4mm^2 on 7nm, so throw in L3 and the numbers kinda work out.
And it's not a confusion, it's an attempt to confirm various points that would appear to be obvious (the number of cores, the amount of IO, AND --- you left this out --- the amount of non-CPU logic [GPU or NPU]) but which were omitted by this article or, apparently, simply incorrect in an earlier article.
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
41 Comments
Back to Article
Raqia - Tuesday, December 3, 2019 - link
I assume those are multi-core figures being quoted against the M5?Andrei Frumusanu - Tuesday, December 3, 2019 - link
Correct.Raqia - Tuesday, December 3, 2019 - link
So they're comparing a 24 core x86 Xeon to a 64 core Neoverse implementation.PeachNCream - Tuesday, December 3, 2019 - link
But it can tow the rear drive only Xeon uphill with an only slightly obvious rolling start so it's clearly better until Intel requests Amazon send a system over to them for an "apples-to-apples" compairson.andrewaggb - Tuesday, December 3, 2019 - link
Unless I'm misunderstanding something it sounds like it'll have worse perf/$ than epyc and not be x64.shompa - Tuesday, December 3, 2019 - link
Not being X64 is great. Why use fake 64bit extensions that need a 32bit CPU core to work when you can use real 64bit and remove the whole 32bit CPU block and save energy and die space.scineram - Wednesday, December 4, 2019 - link
No.kallinteris - Wednesday, December 4, 2019 - link
what do you mean by " use fake 64bit extensions that need a 32bit CPU core to work when you can use real 64bit"all modern x86 programs are compiled for 64 bit anyway
vanilla_gorilla - Tuesday, December 3, 2019 - link
And probably at lower cost and power usage.SarahKerrigan - Tuesday, December 3, 2019 - link
How do you figure? It says "per vCPU." A vCPU is a single thread.Raqia - Wednesday, December 4, 2019 - link
It looks like they're running the multi-threaded benchmark across all vCPUs then dividing by the number of vCPUs. So it's a renormalized multicore figure:https://zdnet1.cbsistatic.com/hub/i/2019/11/30/3d9...
from:
https://www.zdnet.com/article/aws-graviton2-what-i...
ksec - Tuesday, December 3, 2019 - link
2x Performance Per Core over A1. I think that is finally reaching the Desktop / Server Class CPU performance, possibly 70% of Skylake.This is going to KickStart the ARM Server usage. Would love to see some benchmarks on those.
webdoctors - Tuesday, December 3, 2019 - link
There's no kickstarting ARM server usage. Amazon is MASSIVE. Basically if they move their hosted services to their own internal ARM servers, ARM server usage could go from 0 to 50% overnight. Its like the Costco analogy. Whatever Amazon uses internally for their storage, networking, CPU, would become one of the largest in the world overnight.I heard their internal network chip is like 3rd biggest in the world in volume after the big guys because they use so much networking hardware internally. Even without external customers they've got the volume to skew the numbers.
R0H1T - Tuesday, December 3, 2019 - link
>ARM server usage could go from 0 to 50% overnightYou're clearly exaggerating. Pretty sure Google+FB have a bigger share of the market driving server sales in the Enterprise arena, they aren't 50% combined so AWS can't possibly be more than 50% on their own. Heck YouTube alone might be consuming more storage+chips (server) than AWS.
ksec - Wednesday, December 4, 2019 - link
Well it wont be 50%, but i know it is big. I missed the announcement I thought this is going to be like their A1 instances, turns out they intend to have all of the SaaS offer, ( DB, SMS, Mail, DNS Whatever it is ) to be running on ARM.But HyperScaler together ( That is Alibaba, Amazon, Google, Facebook, etc ) together owns 50%+ of the market shipment, and AWS were estimated at 50% of Hyperscaler, so that is likely 20 - 25% of Intel DC revenue vanishing.
ksec - Wednesday, December 4, 2019 - link
Also add they have ( I dont know where it was posted ) also gave out the single core performance to be 30% faster then a Skylake 3.1Ghz thread.That is very impressive.
Operandi - Tuesday, December 3, 2019 - link
30B transistors? Ins't the Xeon chip they are comparing it to like half that?If thats the case that doesn't look all that impressive at all.
blu42 - Tuesday, December 3, 2019 - link
Being stuck in 14nm land is farm more impressive, I agree.Operandi - Tuesday, December 3, 2019 - link
Nice one. Nothing at all to do with my statement but also completely irreverent to the story as a whole.Wilco1 - Tuesday, December 3, 2019 - link
Yeah, so you need not 1 but 2 expensive 24-core Xeon chips to get similar performance. Not impressive at all...mode_13h - Tuesday, December 3, 2019 - link
Sure, Amazon could compare to what they *estimate* a 10 nm Ice Lake server chip could deliver, but that just adds more variable into the mix. There's value in comparing to a known quantity (i.e. a current instance, whether their own or a competitive one).Anyway, such apples-to-apples comparisons will certainly be made, once both types of instances are actually available.
name99 - Tuesday, December 3, 2019 - link
It's not meant to be "impressive", it's meant to be informative.Some of us can put the number in context; for everyone else it's irrelevant.
Spunjji - Wednesday, December 4, 2019 - link
Who cares how many transistors it uses if a comparable Intel product *doesn't exist*?phoenix_rizzen - Thursday, December 12, 2019 - link
Xeon: 1.5 MB of L1, 24 MB of L2, and 33 MB of L3 cache.Graviton: ? MB of L1, 64 MB of L2, 32 MB of L3 cache.
24 cores vs 64 cores.
6 memory controllers vs 8.
48 PCIe lanes vs 64.
And so on. There's other blocks in the CPU die as well that aren't compared here (media, AVX, etc) It's not that hard to figure out why one has more transistors than another.
bryanlarsen - Tuesday, December 3, 2019 - link
A vCPU is half a core on Intel but a full core on Neoverse. So 40% faster per vCPU is actually 30% slower per core.Wilco1 - Tuesday, December 3, 2019 - link
44% faster per vCPU means 95% per core since Hyperthreading gives about 30% on average.SarahKerrigan - Tuesday, December 3, 2019 - link
In terms of throughput, the correct comparison point for one Neoverse vCPU is two Purley vCPUs, because AFAIK a vCPU is added per hard context on EC2. Based on that, a Purley core is still considerably higher throughput than an N1 core.I suspect single-thread is close, or at least would be at base clocks; their mature turbo implementation continues to be a strong point for Intel. I also expect the Graviton2 chip's perf/W to be far better than the Xeon it is compared to.
Wilco1 - Tuesday, December 3, 2019 - link
Sure, but the Neoverse cores are much smaller so you get 2.5 times the cores. If you're interested in throughput you need to compare total throughput per chip rather than per core. According to the SPECINT score a 24-core Skylake-SP gets only half the throughput of one Graviton2 chip, so you need 2 of them.The Platinum 8175 AWS m5 instances have a 3.1GHz all-core turbo (https://en.wikichip.org/wiki/intel/xeon_platinum/8... so getting ~95% of single-threaded performance of the Skylake at its max turbo is pretty impressive!
Antony Newman - Tuesday, December 3, 2019 - link
Ultimate multicore performance for single SoC x86 is being limited by dark silicon on Intel 14nm.For a 64 Core Intel monster - they need their (Intel) 7nm process - or a multi SoC solution.
When TSMC’s 5nm ovens are ready is ready - Amazon will be able to ARMs next Cores that will close the per Core performance gap - but allow considerably more cores before bottlenecking occurs,
A 128 Core Arm Poseidon SoC on TSMC 5nm could very well eclipse a 64 Core Intel CPU bakes on Intel 7nm - but cost Amazon a fraction of the cost.
AJ
mdriftmeyer - Wednesday, December 4, 2019 - link
When TSMC's 5nm is ready AMD's future Zen cores will curb stomp anything ARM can offer, like they already do.Language is a funny thing, ``New Generation of ARM-based instances powered by AWS Graviton2 processors offer 40% better price/performance than current x86-based instances.''
A. That's 40% over previous Graviton processor nodes. BFD.
B. Our upcoming x86-based instances drastically knee cap our current x86-based instances in price/performance but we won't say that as we're trying to sell our own schtick here.
Gondalf - Friday, December 6, 2019 - link
TSMC 5nm do not give such area advantages over 7nm to allow Poseidon. 5nm is more like an half node.mode_13h - Tuesday, December 3, 2019 - link
Just to nit pick, Purley is Intel's LGA 3647-based platform spec - not the core uArch or anything like that.techbug - Friday, December 6, 2019 - link
how per vCPU calculated is totally over my head. Is it total-score on intel processor divided by the number of hardware thread (96, 2 * 48 threads/socket) compared against ARM processor score divided by (128, 2* 64threads/socket) ?name99 - Tuesday, December 3, 2019 - link
How many people buying AWS services care about latency rather than throughput?Sure, you need to hit a minimum per-core performance level, but once that's achieved what matters is the throughput/dollar (including eg rack volume and watts).
Judging a design like this by metrics appropriate to the desktop is just silly.
ksec - Wednesday, December 4, 2019 - link
It doesn't matter, you get 1 thread on Intel vCPU, you get 1 Core per ARM vCPU . The unit are the same. Not to mention a lot of Clients and Workload likes to have HT disabled.As long as the 1 ARM vCPU is cheaper, ( which it is ), and provides comparable performance ( which it does, according to AWS it is 30% faster than a Single Thread 3.1Ghz Skylake ) then it is all that matters.
Sychonut - Tuesday, December 3, 2019 - link
Now imagine this, but on 14++++++.name99 - Tuesday, December 3, 2019 - link
The numbers seem a bit strange, Andrei. I assume we all agree that, while this is a nice step forward in the ARM server space, the individual cores are no Lightning's.So let's look at area; TSMC 7nm so basically like with like:
IF one chip has 32 cores (per yesterday's article) then one core (+support ie L3 etc) is ~10mm^2.
Meanwhile Apple is about 16mm^2 (eyeballing it as about 1/6th of the die for 2 large+small cores,+ L2s + large system cache).
So Apple seems to be getting a LOT more out of their die... Even put aside the small cores and their per big core (+LOTS of cache) is ~8mm^2.
Of course DRAM PHYs take some space, but mainly around the edges.
So possibilities?
- 64 cores on the die, not 32? AND/OR
- LOTS of IO? A few ethernet phy's, some flash controllers, some USB and PCIe?
- lost of the die devoted to GPU/NPU?
The only way I can square it is likely all three are true. Half the die is IO+GPU/NPU (which gets us to 5mm^2/core) AND there are actually 64 cores? WikiChip says an N1+L2 is supposed to be around 1.4mm^2 on 7nm, so throw in L3 and the numbers kinda work out.
ksec - Wednesday, December 4, 2019 - link
They are 32 Core, not 64I/O takes up more space and does not scale well with node changes. Yes. There are lot of I/O needs for Server, especially PCI-E lanes.
Wilco1 - Wednesday, December 4, 2019 - link
The chip has 64 cores, 8 DDR interfaces and 64 PCI lanes.I can't see the confusion about core count, a 48-core Centriq has 18 Billion transistors, this has 30 for 64.
name99 - Wednesday, December 4, 2019 - link
The "confusion" is that this articlehttps://www.anandtech.com/show/15181/aws-designs-3...
claimed 32 cores.
And it's not a confusion, it's an attempt to confirm various points that would appear to be obvious (the number of cores, the amount of IO, AND --- you left this out --- the amount of non-CPU logic [GPU or NPU]) but which were omitted by this article or, apparently, simply incorrect in an earlier article.
SanX - Thursday, December 5, 2019 - link
Would be nice if these chips had FP units and AVX to really compete with Intel and AMD at supercomputer level.