Name: Arm Announces Neoverse N1 & E1 Platforms & CPUs: Enabling A Huge Jump In Infrastructure Performance
Item: Arm Announces Neoverse N1 & E1 Platforms & CPUs: Enabling A Huge Jump In Infrastructure Performance
Author: Andrei Frumusanu

Arm Announces Neoverse N1 & E1 Platforms & CPUs: Enabling A Huge Jump In Infrastructure Performance

by Andrei Frumusanu on 2/20/2019 9:00 AM EST

Post Your Comment
Please log in or sign up to comment.

Comments Locked

101 Comments

Back to Article

WinterCharm - Wednesday, February 20, 2019 - link
There's a gigantic Arm vs x86/64 battle brewing for the entire computer industry. ARM is just more efficient at every level, and if software is properly optimized it performs brilliantly.
eva02langley - Wednesday, February 20, 2019 - link
However, it doesn't have the raw power required for many fields like scientific, compute and research. The core-count is also a huge factor in the upcoming future and unless you develop a chiplet approach, ARM is going to face the same issue of monolithic chips.

The next chiplet evolution will require stacking. The future is way more related to modularity than the chip architecture. Don't get me wrong, the more advancement, the better for everyone, but I don't believe ARM is going to render x86 obsolete, hovwever I believe multi-chips SoC are going to render monolithic chip obsolete in the computer world.
SarahKerrigan - Wednesday, February 20, 2019 - link
Sure it does. There are ARM supercomputers, and this very article shows an N1 core outperforming Zen on single-thread, and both Zen and SKL-SP on throughput.
HStewart - Wednesday, February 20, 2019 - link
I think you are forgetting the very nature of RISC (Arm) vs CISC (x86) architectures. By the nature of designed of RISC - reduce instruction set, it takes more instruction to execute same operation than CISC. For simple stuff RISC can likely do better but remember also modern x86 based CPU also break down more complex instructions in simpler instruction so it can run one multiple pipelines.
SarahKerrigan - Wednesday, February 20, 2019 - link
Dude, I work in the semi industry, and I've designed pipelined cores. Saying "ARM's workload-demonstrated higher performance doesn't matter because x86 is CISC" is idiotic.

SPEC isn't "simple stuff." It is a selection of extremely compute-intensive workstation loads, one that the whole industry - including Intel - uses to demonstrate comparative performance.
HStewart - Wednesday, February 20, 2019 - link
The biggest thing I found that seems misinformation is statement that these are estimates and this chip is simulated which tells me they don't need the real numbers.

All I am saying is that CISC instructions can do more than RISC instructions per instruction, and it depends on compiler to take advantage of the those instructions. Please note I never sated it does not matter and that was in your words. I just mention considerations need to take in account of different architextures and the fact they are comparing future simulated designed to last year designs.
Andrei Frumusanu - Wednesday, February 20, 2019 - link
> All I am saying is that CISC instructions can do more than RISC instructions per instruction

Nobody cares. If the performance per clock is same or higher, you're just arguing about semantics.

Internally CISC processors break things down into RISC like µOps anyway.
ZolaIII - Wednesday, February 20, 2019 - link
@Andrei Frumusanu what would be estimated size of an A55 core with similar amount of cache as on represented E1 on 7nm lithography? I am very curious about that one. Also comparation to the A72 & A73 should be a good thing as ARM clames it reaches their level of performance. Its very interesting first born (SMT) and much needed one.
zmatt - Wednesday, February 20, 2019 - link
When people talk about complex instructions they don't mean something like find the derivative of x^2. They mean something like a conditional move operation. The speed advantages on paper between RISC and CISC are in theory a wash. This is because while CISC can conceivably do more in an instruction, RISC can do more instructions per clock generally. In the real world the simplicity of RISC means usually, all other things being equal, the chips are simpler and can run higher clocks, draw less power and generate less heat for a given level of performance.

x86 chips haven't actually been CISC since the mid 90's. Both Intel and AMD have been making chips that take the CISC instructions and run them through an instruction decoder that then hands RISC instructions to the actual cpu. Yes this does incur some overhead but it frees up cpu design quite a bit without being so closely tied to backwards compatibility.

The fact that modern x86 chips ultimately are actually executing code as reduced instruction sets shows you don't understand the concept.
Wilco1 - Wednesday, February 20, 2019 - link
x86 is still a CISC ISA irrespectively of how it executes instructions. Note that compilers predominantly use the simpler instructions, rather than the microcoded instructions and that's why it's possible for x86 to be fast at all.
Santoval - Thursday, February 21, 2019 - link
"Both Intel and AMD have been making chips that take the CISC instructions and run them through an instruction decoder that then hands RISC instructions to the actual cpu."
The instruction decoder is also part of an "actual CPU". Beside the decoder the front-end also has instruction fetch, a branch predictor, predecode (potentially), μOP & L1 instruction cache, instruction queues, a TLB, allocation queues etc etc All these units are most certainly parts of the "actual CPU".
I believe you rather meant "hands RISC-like instructions to the *back-end* of the CPU".
FunBunny2 - Thursday, February 21, 2019 - link
"The speed advantages on paper between RISC and CISC are in theory a wash. "

not to keep beating the dead horse 360, dated as it is, but with the hardware of the time (and IBM was the top of the heap, then) the 360/30 ran the instruction set in micro-code. allegedly the first computer to even have microcode. ran like drek compared to the all-hardware versions of the machine. the '30 real cpu was long reputed to be some DEC machine.

"cpu design quite a bit without being so closely tied to backwards compatibility."

lots of folks say that, but makes no sense to me. compilers target the instruction set, which only changes when Intel publishes 'extensions'. whether those instructions are executed in pure ISA hardware, or a rat running in a spinning wheel (RISC), makes no difference to the compiler writer.

the profiling explanation for microcode over pure ISA hardware makes the most sense.
Wilco1 - Wednesday, February 20, 2019 - link
The only misinformation is from you. RTL simulation is widely used in the industry and is quite accurate.

Studies have shown CISC instructions don't do more than RISC instructions - partly because compilers avoid CISC instructions, partly because CISC instructions are slow. That's why RISC works. But I wouldn't expect you to understand this.
FunBunny2 - Thursday, February 21, 2019 - link
"Studies have shown CISC instructions don't do more than RISC instructions "

at least in the z world (and predecessors), there were/are some (I don't remember the count) of 'COBOL assist' instructions which were/are quite complex and were introduced to reduce the amount of times the COBOL coders had to 'drop down to assembler'. whether that's still true, I can't say.
DigitalVideoProcessor - Thursday, February 21, 2019 - link
CISC vs. RISC is a debate about instruction decode philosophy and it has almost zero bearing on the performance of a system. CISC machines reduce everything to RISC like operations. Saying one does more than another in a given clock is misinformation.
melgross - Thursday, February 21, 2019 - link
Those wars are long over. No modern chip is either pure CISC or RISC. Those are long gone.
Calin - Thursday, February 21, 2019 - link
SPECint, SPECfp, ... are "work done tasks" - what your're referring to was "MIPS" (or millions of instructions per second). This performance metric has lost its charm since internally x86 processors no longer use x86 instructions but large bundles of microoperations that are done in parallel and can be interleaved (so two instructions that follow each other are broken into micro operations which are reordered, and might be finalized in a different order).
Kevin G - Thursday, February 21, 2019 - link
The thing is that real distinction of CISC vs. RISC is lost in their similar implementations: pipelined OoO parallel execution engines. While CISC encoding may* permit more operations to be contained within a single instruction but at the cost of having to decode that instruction into an optimal arrangement given the hardware. The price paid is in power consumption and complexity which may impact factors like maximum clock speed. In the era of many core and power limitations, these attributes are the foundation for RISC to have an edge over legacy CISC designs. Not to say that RISC architectures can't leverage instruction decoding either: expanding out the fields for registers to account for the larger rename register space is a simple procedure.

Once chips begin parallel execution, the CISC advantage of doing more per instruction really starts to fall apart. The raw amount of work being done per cycle approaches the common limit of just how much parallelism can be extracted by an inherently serial stream of instructions. Arguably CISC designs can hit this sooner in terms of raw instruction count as the instruction stream is _effectively_ compressed compared to RISC.

*The concept of fused-multiply add instructions was an early staple of RISC architectures. Technically it goes against the purest ideal but traditional RISC designs permitted the number of operands in their instruction formatting to pull this off so they took advantage of an easy performance boost. x86 didn't gain this capability until AVX2 a few years ago.
peevee - Tuesday, February 26, 2019 - link
"I think you are forgetting the very nature of RISC (Arm) vs CISC (x86) architectures"

This distinction does not exist in practice for decades.
wumpus - Wednesday, February 20, 2019 - link
It also shows a result showing Zen roughly half the performance of Intel, something that implies a fairly contrived situation. FX8350 might have had half (or worse) than Intel, but Zen is another story.

I'm guessing that this involves AVX256 (or higher) specifically optimized for Intel (note that going to AVX512 is only a modest increase since the clockrate is brutally lowered to compensate for the increased power load. Also note that Zen2 (EPYC2 and Ryzen3000) will include native AVX256 execution paths).
Andrei Frumusanu - Wednesday, February 20, 2019 - link
> It also shows a result showing Zen roughly half the performance of Intel

The W-3175X was at 4.5GHz with the whole 38MB of L3 for the one thread, while the 7601 ran at a peak of 3.2GHz.
Meteor2 - Wednesday, February 20, 2019 - link
I wish you’d normalised for frequency!
Andrei Frumusanu - Wednesday, February 20, 2019 - link
That's not the point of the article.
ZolaIII - Wednesday, February 20, 2019 - link
Next time read twice before posting. AVX on integer benchmark really?
Wilco1 - Wednesday, February 20, 2019 - link
Of course. Never heard of how SIMD hugely affects libquantum for example?
Andrei Frumusanu - Wednesday, February 20, 2019 - link
AVX works on integer ...
ZolaIII - Wednesday, February 20, 2019 - link
The era of general purpose core's being used for HPC is long time gone. While general purpose core's are hire to stay they will do that with modest number of core's per system, the real push is towards special purpose and multi purpose accelerators. FPGA's being put in the first row because their reprogrammable nature. The ARM actually have an edge over the CISC (X86) because it's simply more efficient which having stellar integer performance for the size of the core. If you look at the development bord it's very clear ARM is pushing into right direction.
Meteor2 - Wednesday, February 20, 2019 - link
Kind of. But bottom line is the 20-odd codes used predominantly in the world still run best on general purpose CPUs. Bending software to work on specialised architectures is really hard.
ZolaIII - Thursday, February 21, 2019 - link
On the FPGA you bend hardware. That's the whole idea.
wumpus - Thursday, February 21, 2019 - link
HPC traditionally meant double precision FLOPS. AI work or similar might want FPGAs until GPUs are sufficiently ready for such things (then FPGA can't keep up).

FPGAs are painfully slow at what they do, but can take an entirely new architecture on the fly. We saw that with cryptomining as things went CPU->GPU->FPGA->ASIC. And if you need a lot of multiply-accumulate (like most AI), don't expect anything between GPU and ASIC.
surt - Thursday, February 21, 2019 - link
That raw power comes at a .... power cost. And as soon as you try to start z-stacking your cpus that power is going to be the most important factor.
peevee - Tuesday, February 26, 2019 - link
"The future is way more related to modularity than the chip architecture."

Debatable. Both ARM and x64 are essentially the same in terms of efficiency if the same levels of performance are required. A breakthrough can only come from in-memory computing, which neither ARM nor x64 can sustain for many reasons.
rahvin - Thursday, February 21, 2019 - link
ARM is not "more efficient at every level". That's just plain fanboi BS. The architecture is the least important aspect of any processor these days.

ARM processors were traditionally designed for power efficiency above all else, now that Intel is designing down for efficiency and ARM is designing up for power there will likely be some real competition but so far ARM has not demonstrated that they can provide equivalent power for the same power budget at the high end and Intel has had difficulty matching the lower power budget and performance on the low end (though this is likely due to them wishing to avoid cannibalizing higher end products with performant low power versions).

As ARM tries to enter the server market we'll finally see if they can provide something equivalent, but it's not been a hopeful showing given that all but one ARM server design has been canceled and it's not equivalent to an x86 server processor of the same character in either power or performance.
Wilco1 - Thursday, February 21, 2019 - link
Today you can buy Arm-based servers like Operon A1100, Centriq, ThunderX, ThunderX2, eMAG and HiSilicon. The first Arm supercomputer entered the TOP500 list recently, and Fujitsu has prototypes of their Post-K computer. You can buy Arm compute time from several cloud vendors today, including AWS. That all adds up to one Arm server in your book?
rahvin - Thursday, February 21, 2019 - link
ThunderX is gone, displaced by the ThunderX2 which is the Centriq processor after it was abandoned by it's creator. eMAG, A1100 and the HiSilicon Last I saw are all canceled.

Commercially you can buy one ARM server, the ThunderX2. Go ahead, TRY to buy one.
Wilco1 - Thursday, February 21, 2019 - link
How could you be so clueless? ThunderX2 is based on Vulcan made by Broadcomm, no relation with Centriq at all. ThunderX is still being used and sold. Centriq is still being sold, a few months ago Gigabyte announced a brand new motherboard for it. eMAG is just announced. HiSilicon/Huawei has 2 generations of Arm servers already and is working on several more. That's the only one that isn't for sale outside of China according to AnandTech.

What's next? Are you going to tell us that Arm servers did not beat Xeon and SkyLake in various benchmarks, eventhough the evidence was published in an article on AnandTech?
rahvin - Thursday, February 21, 2019 - link
Your right I confused the Vulcan and the Centriq. The Centriq is dead, the design teams gone,and there is no plan to even spin the silicon from what I've seen. Qualacom abandoned the product under threat from an activist investor. Yea there was a motherboard at CES but that doesn't mean anything at all and there is literally no way to buy one.

ThunderX is depreciated (show me where you can buy one, they depreciated the silicon over a year ago, there may still be some inventory out there but I seriously doubt it), ThunderX2 is available, and from everything I saw it's awful. The best work case was as a nginx master server because the compute capacity was so awful. Basically you need a workload with a lot of threads and no actual work to even make it worth anything at all, especially considering the price.

The Huawei junk is a nonstarter, you can't buy it anywhere but China that I've seen and it's not exactly flying off the shelves either. I've seen more ARM servers announced and canceled a year later than any that made it off the shelves into an actual product. So there is an eMag, that's great show me where you can buy it.

That's my point, you can't buy them, other than the ThunderX2 or the Huawei if you want to go to china to get it. The Arm server has been a flash in the pan and I have no doubt it will continue to be so.
FunBunny2 - Wednesday, February 20, 2019 - link
one has to wonder: given the existence of C compilers for any ISA, and thus *nix OS for said ISA, when (or already?) will the maths dictate both the 'optimal' ISA and underlying microarch? both, after all, are just maths optimization problems. to some delta, there is a unique solution.
zmatt - Wednesday, February 20, 2019 - link
Baring any major design flaws there shouldn't really be a difference in theoretical performance between ISAs. Its important to note that the ISA isn't the actual logic of the chip, its better thought of as a paper standard a given chip needs to conform to if it wants to be binary compatible. The real determination in performance is the microarchitecture. People conflate this with ISA a lot because they are both architectures but the Micro arch is what describes the actual logic design in the circuit. That is what Intel and AMD apply codenames to. So things like Skylake, Thunderbird, Cortex A53 etc are micro architectures.
Wilco1 - Wednesday, February 20, 2019 - link
There certainly are differences between ISAs which cannot be overcome with micro-architecture no matter how much money, power or transistors you throw at it. Given equal resources, the best possible implementations of various ISAs will exhibit major performance differences.
blu42 - Thursday, February 21, 2019 - link
Yep. While ISA may not matter as an aggregate over the set of all tasks, ISAs matter very much when it comes to the performance of any individual task, just the same way as ASICs matter versus gen-purpose CPUs for any given task. One can think of ASICs as an extreme-case specialization of gen-purpose ISAs.
Meteor2 - Wednesday, February 20, 2019 - link
Indeed, and reality is that all architectures are converging in terms of performance. It’s just a question of how much money any given manufacturer wants to invest. Intel cut R&D and the results are plain. AMD invested wisely. What Apple has achieved with the ARM ISA is phenomenal. Goodness knows what they could do if they turned their attention away from mobile but goodness knows how much it cost, too.
Vitor - Wednesday, February 20, 2019 - link
Although the article is about servers and such, I can't help thinking that in less than a decade RISC CPUs can overtake the deskop/notebook market.

And, corretct if I'm wrong, RISC is inherently more efficient than X86 derivates.
SarahKerrigan - Wednesday, February 20, 2019 - link
The evidence for "inherently more efficient" is pretty shaky, although I'd venture that validation of ARM cores is considerably simpler than validation of x86.

That being said, ARM has been delivering rapidly and consistently on uarch, and Intel has not.
hMunster - Wednesday, February 20, 2019 - link
ARM is playing catch-up to Intel which got to the point of "no more low hanging fruit" much earlier.
Wilco1 - Wednesday, February 20, 2019 - link
Well as an example Intel was unable to design competitive SoCs for the mobile market despite having a process advantage, investing $10+ Billion and even paying various companies to use their chips - "contra-revenue". There is no doubt the complexity of x86 translates into a significant overhead in design and verification, area, power and (at the low end) performance.
hMunster - Wednesday, February 20, 2019 - link
The RISC vs. CISC debate does not really matter much anymore.
HStewart - Wednesday, February 20, 2019 - link
A lot of this is because CISC process can now handle multiple microinstructions per clock cycle taking advantage of RISC smaller instruction away.

But software compatibility is the major concern with this and Microsoft has many failed attempts to try to change this dependency.
FunBunny2 - Wednesday, February 20, 2019 - link
"A lot of this is because CISC process can now handle multiple microinstructions per clock cycle taking advantage of RISC smaller instruction away."

that's a testable assertion. not by me, however. the execution of multiple microinstructions by CISC ISA machines doesn't mean, ceteris paribus, that the overlying CISC instruction runs as efficiently as a native RISC instruction; it just must run through the microinstructions. to the extent that CISC ISAs are really executed as some RISC machine on the silicon, that doesn't mean, apples to apples, that said CISC machine executes as efficiently as a native RISC machine. (native RISC does make headaches for the compiler writer, no doubt.) I'd wager that the real reason for RISC microarch was the desire to continue with X86 object code with a bit more performance back when the transistor budget began expanding, but not enough to build the entire ISA in silicon. and to keep the compiler writer from having to continually update as the real ISA (RISC) keeps changing. die shots of current cpu show that the 'core' is a diminishing percent of the real estate.

the still unanswered question: why did Intel/AMD not use the exploding transistor budget to execute the entire instruction set in hardware, but to create these behind-the-scenes RISC machines?
wumpus - Wednesday, February 20, 2019 - link
From memory, Dec was able to make VAX four times faster by pipelining the microcode from VAX instructions compared to "executing the VAX instruction all at once faster". VAX was about the CISCisest CISC that ever CISCed (and sold successfully. I think Intel's BiiN was worse).

Dec also made Alpha, which even the first iteration was another 4 times faster than the "pipelined microcode" VAX.

And this was all single issue. Don't even think of trying to issue multiple "full CISC" instructions at once.
lightningz71 - Thursday, February 21, 2019 - link
This is one I can answer. My computer engineering professors fielded this exact question. Essentially, when profiling code that was being used in modern software, the major CPU vendors realized that a small portion of the x86 instructions were rarely used. So rarely, in fact, that it was an absolute waste of silicone to try to implement them in hardware as it would be so rarely used. Add in that a lot of those instruction are not executed in isolation, but have some sort of dependency on fetching a piece of data, or waiting on the resolution of multiple intermediary steps during their execution, that going with full hardware implementations would not have resulted in a major boost in their performance. Instead, they elected to implement them in micro-code and execute them on the highly tuned circuits that they used to implement the more common instructions in the back end. So, while you loose some performance having to load and run the microcode sequences, its actually executing those simplified sub-instructions very rapidly, and can do other things while waiting for various tasks to complete.

so, while there is a case to be made that a full, tuned and optimized hardware implementation of the more complex instructions can be done, and perform more quickly than the micro-code sequences, the actual speedup for the overall performance of the systems in question would be minimal because of how rarely those actual instructions are used in practice. You're talking about shaving off a few tens of cycles per instance on a processor that is running at around 4Ghz these days. The real performance impact would be minimal, but the development cost and circuit budget consumed would be significant for not much gain.
FunBunny2 - Thursday, February 21, 2019 - link
"Essentially, when profiling code that was being used in modern software, the major CPU vendors realized that a small portion of the x86 instructions were rarely used. "

not to do too much what-about-ism, but IBM was doing that with COBOL applications, in real time monitoring (allowance to do so was embedded in the lease agreement), at least as early as the 360.

naturally, I didn't remember that lower brain stem memory until reading your comment. my shame. (:

but... I do wonder about all those 'extensions' to the original 8086 instruction set. weren't they created to support 'necessary' functions? here: https://en.wikichip.org/wiki/x86/extensions

or are they, too, not used enough?
Wilco1 - Thursday, February 21, 2019 - link
Well when did you last use MMX? Or x87 floating point? There are large numbers of instructions which are hardly ever used.
FunBunny2 - Thursday, February 21, 2019 - link
HLL coders don't, at least directly. but I'm old enough to remember when adding a '87 (before FP was moved to the '86) put a rocket under 1-2-3.
Wilco1 - Thursday, February 21, 2019 - link
The point is both have been superceded by all the SSE variants which itself is now being replaced by AVX. Intel has posted patches to change HLL MMX intrinsics to use SSE instructions instead of MMX.
zmatt - Wednesday, February 27, 2019 - link
Usually you don't invoke those yourself. The compiler does.
nevcairiel - Wednesday, February 20, 2019 - link
The desktop and notebook market will face adoption problems simply from having your software run (fast). Of course they can use emulation layers, but that once again costs you efficiency/performance.

Mobile was an entirely new space, so no pre-existing software to really worry about, and servers are a far more managed space so that software is often more readily available in the variants you need. Desktop usages on the other hand are full of legacy software that has to work.
ZolaIII - Wednesday, February 20, 2019 - link
In it's core (integer base instruction set) it is more efficient but that doesn't mean much nowadays. Main factor is design of actual core as such.
ballsystemlord - Wednesday, February 20, 2019 - link
But, and here's the kicker, the binary nature of proprietary SW means that switching arches will require many fixes to programs and many more will never be ported. Emulation, which is slow for CPU arches, is the only way that such SW could continue to exist.
Gee, Stallman was wright!
wumpus - Thursday, February 21, 2019 - link
Put it this way: the effective means to convert a "CISC" architecture to internally* "RISCY" operation could be included on a CPU core effectively in the mid 1990s. This pipeline step is sufficiently small to make no difference nowadays (although Sandy Bridge and later use caches to store pre-decoded micro-ops). The RISC/CISC wars died a long time ago, and now we only have Intel vs. ARM vs. AMD (and don't forget IBM).

* (Internally RISC). Oddly enough, the more "internally RISCy" a 1990s-era chip was the less successful it was. The AMD K5 was internally a 29k derivative (a real RISC) and failed miserably. Supposedly IBM had a PowerPC/X86 hybrid that never made it out of the lab. Transmeta did its translation in software, but fell into the "single device power trap". Nextgen was probably more successful than all of these (especially in convincing AMD to buy them and producing the mighty Athlon), and had the ability to execute native code (supposedly. I don't think anyone ever did. Presumably involved 80 bit instructions). Pentium Pro, K6, Pentiums 2&3, Athlon all executed "native microcodes" but don't appear to slavishly copy RISC dogma.
wumpus - Thursday, February 21, 2019 - link
This was supposed to be a reply to "X86 is less efficient than RISC".
Neutral - Wednesday, February 20, 2019 - link
"All of these new microarchitectures are important to Arm because they represent an infliction point in the market"...
Dictionary result for infliction
/inˈflikSHən/
noun
the action of inflicting something unpleasant or painful on someone or something.
Ryan Smith - Wednesday, February 20, 2019 - link
Whoops, there's one heck of a typo. Thanks!
phoenix_rizzen - Thursday, February 21, 2019 - link
There's several other typos, incorrect word usage, and "correctly spelled but incorrect word" errors in this article.

Excellent article with a *lot* of great information. But a lot of niggles that should have been picked up by an editor before it was published.
GreenReaper - Wednesday, February 20, 2019 - link
They're bringing *pain* to the competition! :-D
Antony Newman - Wednesday, February 20, 2019 - link
Ryan : I think I counted several (6?) typos - a spellcheck will pick up most of them (there was also a DDR related one with a missing letter).

AJ
eastcoast_pete - Wednesday, February 20, 2019 - link
@Andrei: Thanks! Sounds very promising, I look forward to ARM-based solutions giving Intel (and AMD) some competitive pressure in the server space, now that SPARC and Power are either down or almost out of this market. Question: Any mention of Microsoft server-type applications being ported to run native on ARM64/Neoverse? That would open a large market for Neoverse and following designs.

Other comments: So, did Qualcomm get out of the server market exactly at the wrong time? Sure looks that way. Huawei played it smart, using its development and know-how from the mobile space to make itself a serious contender for ARM-based servers (China's push to have non-US solutions for their home markets doesn't hurt them, either).
The big technical questions for me are: What will the performance and energy use of the CMN-600 mesh network and CCIX be? AMD is trying its chiplet approach, but their first generation had high energy consumption and some performance degradation by the interconnect, which still has them at a disadvantage to Intel's all cores on one die approach. However, scaling up becomes prohibitive as the die gets bigger and transistor counts get higher. I see the interconnect tech as the next big thing for servers and HPC. If they (ARM) and partners can pull this off and be the first with a high-performance, energy-efficient interconnect, they can clobber Intel or AMD just by combining more and more cheap, high-performing cores using chiplets, and do so at a much lower per-core and per-performance cost.
SarahKerrigan - Wednesday, February 20, 2019 - link
Power is not even remotely out of the server market. IBM sells billions of dollars of servers per year, and the top two systems on Top500 are Power.
eastcoast_pete - Wednesday, February 20, 2019 - link
Didn't say that Power is completely out of the server market; SPARC is, thanks to Larry. However, while Power is big in HPC and Supercomputing, it's become rare these days to find a Power-based unit in a server closet or rack for more general use, especially outside of large enterprises. IBM also sells quite a number of Intel-based servers, and I believe that accounts for a sizeable chunk of their remaining server business.
SarahKerrigan - Wednesday, February 20, 2019 - link
That's not accurate. IBM offloaded their x86 server business (to Lenovo) in 2014. At this point, it is just z mainframes and Power.

My company owns a pair of Power systems, one single-socket and one two-socket, and we're pretty small. It's not just large enterprises. OpenPower has brought down entry costs significantly.
eastcoast_pete - Thursday, February 21, 2019 - link
I stand corrected on IBM still selling x86 servers (they don't), and am actually glad to hear that Power is also used in smaller shops. It's just that I haven't run into too many Power systems around here. It's a very capable arch.
Kevin G - Thursday, February 21, 2019 - link
The big Power users are also the big cloud providers. Google and Amazon have reportedly taken a liking to openPower hardware. Facebook has reportedly looking into openPower as well. Granted this are small scale compared to the number of x86 systems these companies have, it was a much need shot of energy into the Power platform.
nevcairiel - Wednesday, February 20, 2019 - link
Microsoft already ported Windows Server to ARM, and their entire development stack has support for ARM and ARM64 now, so its only a matter of time for the other server products to be made available.
HStewart - Wednesday, February 20, 2019 - link
It really funny that Microsoft did not trust that environment enough to created Surface using ARM processor.
GreenReaper - Wednesday, February 20, 2019 - link
Uh . . . that's reportedly because Intel came begging them *not* to for the Surface Go 2018 (and probably cut them a very nice deal on the Pentium Gold as a result): https://www.techradar.com/uk/news/microsoft-surfac...

As mentioned, you can also compile for 64-bit ARM in VS now. This is a major win for some apps which truly require native execution (which is not all of them, but enough to be a pain):
https://blogs.windows.com/buildingapps/2018/11/15/...

Will it actually become a viable platform as a result of all this? I suspect it still won't be the default in five years, but in cost-conscious areas it could end up with a foothold. Even if Microsoft doesn't go down that route, it may be open for others to do so for specific purposes, such as education.
eastcoast_pete - Wednesday, February 20, 2019 - link
Not funny; rather, cautious. None of the A76 designs were in silico when MS designed the current Surface. When you specc out a design like the Surface, you base it on what's available at that time, not what might be around next year. Otherwise, the chance of ending up with egg on one's face is uncomfortably high.
eastcoast_pete - Wednesday, February 20, 2019 - link
I agree with you, but, as we all know, businesses buy the hardware that can run the software they want or need, not the other way around. In this regard, I am curious if Oracle and SAP are porting their offerings to ARM64 server. If both of those are on board, this design would have a great chance to get strong traction.
HStewart - Wednesday, February 20, 2019 - link
One thing that I am concern in this article is that this chip is mark as Simulated in charts which to me is just a marketing term. Also it comparing again existing 2018 designs for both Intel and AMD. An actual fair comparison with Sunny Cove based cpu with more units and such.

I also think that just increase cores is not the best way to handle performance, in todays world single core performance is still very important but this depends on the market the chip is intended for but the important part is software compatibility.
Antony Newman - Wednesday, February 20, 2019 - link
H.

A Simulated vs Historical point was made in the article. Perhaps you need to reread?

Also : Single Core performance is very important - especially when they are all running flat out.
Intel has to throttle down their multi core beasts so the chips don’t catch fire at 14nm.
At 10nm - Intel will be able to sustain a few more cores before throttling.
And before Intel are at (Intel) 7nm, ARM will likely overtake Intel on the IPC front (assuming that ARM’s ‘prediction is as accurate as my own)

AJ
eastcoast_pete - Wednesday, February 20, 2019 - link
Single core is still important for client computers, but much less so for servers.
Antony Newman - Thursday, February 21, 2019 - link
(Arbitrary example)

If a SoC can run at 5GHz when 8 cores single core, but throttles down to 2.5GHz when 16 cores are active - then it cannot scale (due to the TDP limit).

If ARM are designing their CPUs so that 128 (ie all) of them can run flat out without requiring throttling, then ARMs single core performance is indicative of the overall performance.

If ARM increase their single core performance by 1.7 times in two years - and keep this same MO (of no Throttling cause to keep within the TDP) - it will be more than just data centres that want to buy into this new architecture.

AJ
wumpus - Thursday, February 21, 2019 - link
Very few problems scale without penalty. Having high single core performance (for each core in a multichip server CPU, obviously. The Intel result using all of its cache on one core is obviously irrelevant and why it was so anomalous vs. AMD) means far less cores are needed, when scaling up. Also adding more and more cores require as much cache or more. If not, your bandwidth will scale even worse.

Single core is absolutely critical for servers, and why it is taking ARM so long to break in. IBM is the exception that proves the rule: but they rely on weird licensing rules and making sure all the threads can access the same cache.
eastcoast_pete - Thursday, February 21, 2019 - link
I actually think we are in agreement. While this borders on semantics, per core performance is, of course, very important for servers, while high single (one) core is not. As you point out, Intel getting really high one core performance from a 18 core Xeon by running a strictly single core/thread test while allocating all the cache and much of the thermal envelope to that one core is an artificial situation for a server.
The_Assimilator - Wednesday, February 20, 2019 - link
Remember when "system on chip" meant IO too? Apparently Arm doesn't.

Remember when Arm chips didn't need HSFs to run? Pepperidge Farm remembers.

I'm going to enjoy it when this, like all of Arm's previous attempts at the high-end, fails once again. Or when Lakefield eats Arm's lunch, whichever comes first.
wumpus - Wednesday, February 20, 2019 - link
When your volume is 1400 chips (not all the same design) over 4 years, you use FPGA for anything you can. Doing anything else is pretty dumb. I'm surprised they bothered with an actual layout, but I suspect that they've been bitten by tiny details in FPGA simulation that never quite worked the same at speed.

HSF? You want the MIPS, you burn the Watts. Presumably this is your "tell" in your troll.

When has ARM made a previous attempt at the high-end? Certainly more than a few of their architectural licensees have, but there's a huge difference between a server architecture backed by ARM and even one backed by Qualcom. For one thing, they pretty much need to standardize remote adminstration to Intel levels (possibly circa ~2008ish to get off the ground). That's a lot of pesky little details, but something they absolutely need standardized to allow server use in the datacenter (yes, the Big Boys can roll their own, but everybody else needs a common server definition.
Antony Newman - Wednesday, February 20, 2019 - link
Fascinating article.

Do you think Ampere, Huawei, Cavium and Amazon will all switch to the Neoverse?

In terms of IPC - do you have a view on if ARM have Caught up with Apples Vortex yet?

Is there any reason why a mobile phone (or Tablet) maker would’t use the ARM ‘server’ chip in a fondleslab?

AJ
ballsystemlord - Wednesday, February 20, 2019 - link
Spelling and grammar corrections:
...the actual real-life performance improvements will higher due other SoC-level improvements as well as software improvements that aren't available in existing actual A72 silicon products.
Missing be:
...the actual real-life performance improvements will behigher due other SoC-level improvements as well as software improvements that aren't available in existing actual A72 silicon products.

The figured weren't run actual silicon but rather estimated on Arm's server farm in an emulation environment with RTL.
Miswritten sentence:
The figures weren't calculated on actual silicon but rather estimated on Arm's server farm in an emulation environment with RTL.

The E1's CPU pipeline actually represents a brand new-design which (besides the A65) haven't seen employed before.
Missing we:
The E1's CPU pipeline actually represents a brand new-design which (besides the A65) we haven't seen employed before.

Here we have to clusters of 8 cores in a small CMN-600 2x4 mesh network, ...
Wrong 2:
Here we have two clusters of 8 cores in a small CMN-600 2x4 mesh network, ...

I was half asleep when I read it so there might be more.
sohntech43 - Wednesday, February 20, 2019 - link
Could someone help me understand why the Spec CPU2006 results are so different from those recorded for the AMD 7601 (1000 - 1200 vs. 690.63) and Xeon Platinum results (1300+ vs 730) in the Spec data base?

https://www.spec.org/cpu2006/results/cpu2006.html

They are also different from what AMD was boasting at the time of the original EPYC launch:

https://www.microway.com/download/whitepaper/AMD-E...

I'm probably missing something obvious...
Wilco1 - Wednesday, February 20, 2019 - link
Yes you're missing the fact these are GCC8 scores using -Ofast as mentioned in the article - ie. like when you build code yourself.

Official SPEC scores are quite different and use special trick compilers to get the highest score. For example libquantum shows a completely unrealistic result in most SPEC submissions which artificially inflates the integer score by 30+%.
sohntech43 - Wednesday, February 20, 2019 - link
Thanks - was surprised by the sheer magnitude of the delta caused by the compilers. Impressive results for N1 and will be interesting to see when silicon is available.
platinumjsi - Thursday, February 21, 2019 - link
Wouldnt mind seeing an Anadtech article detailing the differences between X86 and Arm and explaining the benefits and downsides of each architecture.
Looking at this if ARM can do 64 cores for 105w I have to wonder why it takes AMD 250w on the same process?
Intel / AMD should be seriously worried about this.

Could we see ARM-based Laptops / Desktops? Wouldnt be any good for gaming as I cant see devs recompiling there back cataloge for ARM but for Office use these seem ideal.
SarahKerrigan - Thursday, February 21, 2019 - link
ARM laptops already exist, running Windows. Look at the Asus NovaGo or the Lenovo Yoga C630.
edzieba - Thursday, February 21, 2019 - link
ARM still has to overcome the same issue that killed off every other HPC architecture (barring POWER just about hanging in there): not being x86 ( http://3s81si1s5ygj3mzby34dq6qf-wpengine.netdna-ss... ).
Antony Newman - Thursday, February 21, 2019 - link
Perhaps this will all change when Apple move over to an ARM ISA? Apple would have to offer a system that is at least as capable as x86 with a significantly lower lower TCO. I think the limitation up until know was the wimpiness of the ARM offerings; the offerings were cheap but not performant.
javadesigner - Friday, February 22, 2019 - link
What does the "A" in ARM stand for ? I always thought it was Apple. Does Apple make any licensing money when Samsung makes an (A)pple(R)(M) chip ?
TobiWahn_Kenobi - Saturday, February 23, 2019 - link
Acorn RISC Machines
TobiWahn_Kenobi - Saturday, February 23, 2019 - link
Nowadays Advances RISC Machines.
darkich - Sunday, February 24, 2019 - link
.. Isn't this basically (as expected) an expansion of the Cortex A76 architecture?
techbug - Sunday, June 9, 2019 - link
intel w-3175x's rate score doesn't look right. Its single-thread score is 48.18 and 56T is 729.82. So each thread only contributes only 729.82/56=13.03, which seems to be too low.
obi210 - Sunday, February 2, 2020 - link
A Neoverse E1 derivative, without SMT, could be the next gen Arm little core. Such a core with Armv9 support could provide near A73 performance. In-order A53 derivatives should not be carried over to next gen.

Arm Announces Neoverse N1 & E1 Platforms & CPUs: Enabling A Huge Jump In Infrastructure Performance

Post Your Comment

101 Comments

Back to Article

WinterCharm - Wednesday, February 20, 2019 - link

eva02langley - Wednesday, February 20, 2019 - link

SarahKerrigan - Wednesday, February 20, 2019 - link

HStewart - Wednesday, February 20, 2019 - link

SarahKerrigan - Wednesday, February 20, 2019 - link

HStewart - Wednesday, February 20, 2019 - link

Andrei Frumusanu - Wednesday, February 20, 2019 - link

ZolaIII - Wednesday, February 20, 2019 - link

zmatt - Wednesday, February 20, 2019 - link

Wilco1 - Wednesday, February 20, 2019 - link

Santoval - Thursday, February 21, 2019 - link

FunBunny2 - Thursday, February 21, 2019 - link

Wilco1 - Wednesday, February 20, 2019 - link

FunBunny2 - Thursday, February 21, 2019 - link

DigitalVideoProcessor - Thursday, February 21, 2019 - link

melgross - Thursday, February 21, 2019 - link

Calin - Thursday, February 21, 2019 - link

Kevin G - Thursday, February 21, 2019 - link

peevee - Tuesday, February 26, 2019 - link

wumpus - Wednesday, February 20, 2019 - link

Andrei Frumusanu - Wednesday, February 20, 2019 - link

Meteor2 - Wednesday, February 20, 2019 - link

Andrei Frumusanu - Wednesday, February 20, 2019 - link

ZolaIII - Wednesday, February 20, 2019 - link

Wilco1 - Wednesday, February 20, 2019 - link

Andrei Frumusanu - Wednesday, February 20, 2019 - link

ZolaIII - Wednesday, February 20, 2019 - link

Meteor2 - Wednesday, February 20, 2019 - link

ZolaIII - Thursday, February 21, 2019 - link

wumpus - Thursday, February 21, 2019 - link

surt - Thursday, February 21, 2019 - link

peevee - Tuesday, February 26, 2019 - link

rahvin - Thursday, February 21, 2019 - link

Wilco1 - Thursday, February 21, 2019 - link

rahvin - Thursday, February 21, 2019 - link

Wilco1 - Thursday, February 21, 2019 - link

rahvin - Thursday, February 21, 2019 - link

FunBunny2 - Wednesday, February 20, 2019 - link

zmatt - Wednesday, February 20, 2019 - link

Wilco1 - Wednesday, February 20, 2019 - link

blu42 - Thursday, February 21, 2019 - link

Meteor2 - Wednesday, February 20, 2019 - link

Vitor - Wednesday, February 20, 2019 - link

SarahKerrigan - Wednesday, February 20, 2019 - link

hMunster - Wednesday, February 20, 2019 - link

Wilco1 - Wednesday, February 20, 2019 - link

hMunster - Wednesday, February 20, 2019 - link

HStewart - Wednesday, February 20, 2019 - link

FunBunny2 - Wednesday, February 20, 2019 - link

wumpus - Wednesday, February 20, 2019 - link

lightningz71 - Thursday, February 21, 2019 - link

FunBunny2 - Thursday, February 21, 2019 - link

Wilco1 - Thursday, February 21, 2019 - link

FunBunny2 - Thursday, February 21, 2019 - link

Wilco1 - Thursday, February 21, 2019 - link

zmatt - Wednesday, February 27, 2019 - link

nevcairiel - Wednesday, February 20, 2019 - link

ZolaIII - Wednesday, February 20, 2019 - link

ballsystemlord - Wednesday, February 20, 2019 - link

wumpus - Thursday, February 21, 2019 - link

wumpus - Thursday, February 21, 2019 - link

Neutral - Wednesday, February 20, 2019 - link

Ryan Smith - Wednesday, February 20, 2019 - link

phoenix_rizzen - Thursday, February 21, 2019 - link

GreenReaper - Wednesday, February 20, 2019 - link

Antony Newman - Wednesday, February 20, 2019 - link

eastcoast_pete - Wednesday, February 20, 2019 - link

SarahKerrigan - Wednesday, February 20, 2019 - link

eastcoast_pete - Wednesday, February 20, 2019 - link

SarahKerrigan - Wednesday, February 20, 2019 - link

eastcoast_pete - Thursday, February 21, 2019 - link

Kevin G - Thursday, February 21, 2019 - link

nevcairiel - Wednesday, February 20, 2019 - link

HStewart - Wednesday, February 20, 2019 - link

GreenReaper - Wednesday, February 20, 2019 - link

eastcoast_pete - Wednesday, February 20, 2019 - link