Comments for Qualcomm Demos 48-Core Centriq 2400 Server SoC in Action, Begins Sampling

Qualcomm Demos 48-Core Centriq 2400 Server SoC in Action, Begins Sampling

by Anton Shilov on 12/16/2016 6:00 PM EST

Post Your Comment
Please log in or sign up to comment.

Comments Locked

88 Comments

Back to Article

witeken - Friday, December 16, 2016 - link
"and it is naturally going to capitalize on the fact that it takes two Intel multi-core CPUs to offer the same amount of physical cores."

These chips will compete against Skylake-EP, which will be launched mid-17, which will have up to 32 cores, so at best Qualcomm has 1.5x as many. But core count on its own is just as worthless as frequency. Performance also depends on the architecture.
prisonerX - Friday, December 16, 2016 - link
Actually you're entirely wrong. Because maximum frequencies have long plateaued and there is an upper limit on how much heat, ie power, you can output from a square mm of silicon, going forward core count and attendant low-power archs will win. Have a look at graphics cards for the future.

You're right when you say "Performance also depends on the architecture" but you have the wrong performance in mind. Power consumption is the key performance metric because it determines density and therefore overall throughput. Single thread performance as an absolute measure is meaningless in this scenario, it's all relative to how much space and power it takes to deliver it.

Intel's legacy architecture can't compete in this respect. You'll note that frequency decreases in Intel processors are core count increases. Intel lost on mobile becuase they can't compete on low, low power, and they'll also lose in the high core count future becuase they've wedded themselves to their old and inefficient architecture.
witeken - Friday, December 16, 2016 - link
"Intel's legacy architecture can't compete in this respect. You'll note that frequency decreases in Intel processors are core count increases. Intel lost on mobile becuase they can't compete on low, low power, and they'll also lose in the high core count future becuase they've wedded themselves to their old and inefficient architecture."

I recently read some quote that people who are *sure* someone is wrong, are usually actually wrong themselves. And that is here entirely the case. ISA has little to zero impact. Process technology is so many order of magnitude more important.

Please read following article from AT: The x86 Power Myth Busted: In-Depth Clover Trail Power Analysis: http://www.anandtech.com/show/6529/busting-the-x86...
Krysto - Friday, December 16, 2016 - link
That article was SO SO wrong. I don't know whether he did it on purpose (Anand was showing himself to be a pretty big Intel fanboi at the time) or whether he simply missed it, but it's strange that he would've missed something so obvious.

Here's the thing: Anand compared a 32nm planar process for the ARM chips, with Intel's 3D Trigate 22nm process. Not only was the 22nm vs 32 an entire process generation ahead on its own, but so was the Trigate process vs planar (Intel essentially jumped 2 process generations in terms of performance gains when it moves from 32nm planar to 22nm Trigate).

So Intel was not one, but TWO process generations ahead for Atom compared to the equivalent ARM chips, and yet it could still BARELY compete with those ARM chips - yet Anand's conclusion was "x86 myth busted" ?!

If anything it proves the myth was NEVER busted if the ARM chips could still hold their own despite being two process generations behind.

Another very important point, that Atom chip was still significantly more expensive than than those ARM competitors. It's why Intel had to lose $1 billion a quarter to keep subsidizing it and make it remotely attractive to mobile device makers, and why it eventually licensed the Atom design to Rockchip and Spreadstrum, thinking they could build it cheaper. But it was always going to be a losing strategy because by the time those Rockchip Atom chips came out at 28nm planar, ARM chips were already at 20nm. And as explained above, Atom was only competitive when it was two process nodes ahead. So Intel never had a chance with Atom.
witeken - Friday, December 16, 2016 - link
Lol, have you actually looked at the article? As of 2012, Intel did not have mobile finfet devices, which makes your entire rant obsolete.

And if I may add another truism: people who invent facts, can argue for anything (and it doesn't matter if this is because of ignorance, i.e. not having enough information). So I will simply ignore all your falsities.
witeken - Friday, December 16, 2016 - link
Anyway. I would say the burden of proof is on your side. Please scientifically explain why ARM is more power efficient. If you can't, that could be because you are wrong.

The world does not run on magic.
name99 - Saturday, December 17, 2016 - link
The issues are not the technical ones that are being thrown around. The issues that matter are

(a) design and validation time. We know it took 7 years to design the Nehalem class of CPUs. This appears to have stretched out to eight years with Kaby Lake. Meanwhile we know it took Samsung 3 years to design their fully custom Exynos, so Apple and QC are probably on a similar (perhaps four year) schedule.
Obviously a server is more complicated (look how Intel servers ship two years after their base cores) and we don't know how much of that extra validation would affect QC. But everything points to QC and others having an easier job, and being able to respond faster to market changes and possible new design ideas.

(b) prices. Intel finances its operation by extracting the most money from their most captive buyers. Once QC and other ship chips that are at an acceptable performance level, Intel can certainly respond with lower prices, like they have with Xeon-D; the problem is that doing so throws their whole financial structure into chaos.
Gondalf - Sunday, December 18, 2016 - link
Na my friend, a brand new x86 cpu takes 3 years to develop more like an ARM cpu/SOC, nothing is different with actual sw tools Intel has, the little add that Intel does in development (1 year) is because Intel cores are fully Server Space validated and ready for a 24/7 utilization without elettromigration for at least 10 years.
Please take some proof before post on Anandtech.
Wilco1 - Sunday, December 18, 2016 - link
If developing new x86 CPUs is so easy, then explain how is it possible that a small company like ARM is able to release new microarchitectures much faster? There have been just 3 Atom microarchitectures in 8 years despite spending $10+ billion on mobile. ARM produced more than 10 different A-class cores since 2008 for a tiny fraction of Intel's cost.

Please get a clue before posting on Anandtech.
Gondalf - Sunday, December 18, 2016 - link
They are only cores not implemented in a SOC cores, the strong cost of SOC implementation is on licensee balance sheet. About costs of ARM and Intel for the development of the architecture only without the hard implementation of it on a real chip....we have not real figures but sure they are comparable at a certain level of complexity and debugging. Decoders are standard building blocks only in Intel/AMD sw.
About Atom, it was an Intel choice to stay a lot on a single arc refining the uncore instead, you know Intel way to think, pretty strange to look you doing these reasonings......pretty strange indeed.
Wilco1 - Sunday, December 18, 2016 - link
No wrong again. ARM does design and license everything required to make a SoC. On top of that ARM does physical IP for lots of processes and sells pre-hardened cores that are optimized for a specific process and ready for use with minimal integration. So except for actually making chips that's as much as Intel does. Licensees decide how much additional work they want to do, some take ARM's cores as is, some do fine tuning, others design their own cores.

It's true with mobile Atom Intel was a lot more behind on uncore than core. But they were late with everything, uncore, SoC, decent GPUs, faster CPUs, radio, mobile process etc. Apparently before canning it all, earlier this year they finally finished their very first SoC with on-chip radio - not exactly a barn burner at 1.5GHz on TSMC 28nm! I told people years ago SoFIA would be beaten by faster, smaller and cheaper Cortex-A53's. So no, I don't know the way Intel think, pretty much every move they made didn't make sense to me. Plenty of people fell for the marketing though.
deltaFx2 - Sunday, December 18, 2016 - link
@ Wilco1: Developing any CPU isn't easy; first-order, ISA doesn't make things easier; x86 has more legacy verification, but it's small compared to the effort in verifying any CPU. The thing about ARM, though, is that ARM doesn't actually build anything. Even in its simplest avatar, ARM leaves the integration to the vendors. It's not an apples-to-apples comparison. In the same duration as it did 3 atom cores, Intel produced Sandy Bridge/Ivy Bridge/Nehalem/Westmere/Haswell/BroadWell/Skylake/KabyLake. It's a question of focus, I think, more than anything else. Intel neglected the Atom line initially, only to realize late in the game that the cellphone market was getting real. But it was too late.
name99 - Sunday, December 18, 2016 - link
I am not making the numbers up.
The Samsung number comes from their talk at Hot Chips 2016. The Nehalem number comes from a talk given to Stanford EE380 soon after Nehalem was released.
name99 - Sunday, December 18, 2016 - link
The slides for the talk are here. The video seems to have disappeared (it was once public).
The slides refer to 5+ design years, but in the talk he said the time kept growing and was at around 7 years in 2010.
http://web.stanford.edu/class/ee380/Abstracts/1002...
Kevin G - Monday, December 19, 2016 - link
@name99

Is this the video you were referring to?

https://www.youtube.com/watch?v=BBMeplaz0HA

That video was filmed in 2006 and was uploaded in June of 2008, both prior to Nehalem being released. It does have plenty of insights into how the industry was working during that time frame.
name99 - Monday, December 19, 2016 - link
Obviously it's not the talk I was referring to! Look at the slides. The talk was given by Glenn Hinton in Feb 2010.
Kevin G - Monday, December 19, 2016 - link
@name99

Lets try these droids:

http://www.yovisto.com/video/17687
name99 - Monday, December 19, 2016 - link
Well done! Kevin G wins the internets for today!
chlamchowder - Sunday, December 18, 2016 - link
In response to (b), buyers are not captive. But no other chip maker can offer competitive per-thread performance, on any architecture.

With multithreaded performance, AMD tried with Bulldozer/Piledriver (lower price), and IBM tried with Power8 (more perf, but more heat). But Intel's still dominating servers.
Michael Bay - Monday, December 19, 2016 - link
You`re talking to intel-hating fruit fanatic, what did you expect?
Kevin G - Monday, December 19, 2016 - link
(a) A couple of those years are high level design improvements on paper. With Intel's tick-tock cadence (now design, optimize, process), they much of what is put into these early stages is actually form the previous generation chip they're currently working on but couldn't get working/expected to validate before deliverable time table. This was also mainly between ticks as the tocks were just shrinks of the ticks. Also much like ARM, Intel has different teams working on different chip functions that get put together for an end chip. Memory controller design teams are different from GPU which are different from CPU etc. These may have different design cadences.

The big 800 lbs. gorilla in the room is that Intel owns its fabs and directly feeds in process optimization design rules directly into their logic design teams. ARM, QC, and others are at the mercy of outside foundries for this information. Thankfully this information flow starts well before a production node is ready for production. Raw data here is hard to come by but it is believed that this information exchange happens a bit later in the design process than what gets fed into Intel's logic design teams.

(b) This what really optimizes its production line up for. Intel does know when to charge outrageous premiums *cough* 10 core i7 *cough* and it knows when to subsidize to get market share like with Atom. The problem is that this can create conflicts with their otherwise profit maximizing line up. This is why Xeon D is a soldered part and sold mainly to OEMs and companies doing private designs like Facebook and Google. This was purely to prevent ARM from gaining a foothold in the datacenter in the low power sector. Intel would rather these companies purchase more expensive Xeon E5s.

The real chaos has always been getting into mobile for Intel. The only issue for Atom is that at the power levels needed for a phone, ARM was surprisingly competitive performance wise. Intel's pricing of Atom wasn't bad but it didn't offset the software costs in porting/validation x86 plus the need to actively support both x86 and ARM handsets. Intel also failed to indicate that they were willing to continue to provide chips at those prices long term. It seemed that everyone knew that this was a market share strategy and as soon as Intel had their foothold, chip prices would climb.
chlamchowder - Friday, December 16, 2016 - link
No? The article compared the Atom Z2760 and Tegra 3. The Z2760 is Intel's now old Saltwell architecture, on 32nm. The Tegra 3 is on 40 nm. Not sure how far those are apart.

But going beyond that, what ultimately matters is who's more power efficient. If Intel's more power efficient on 22 nm than ARM is on 32 nm, Intel wins. Nobody will buy ARM just because it did an admirable job on 32 nm and didn't lose too badly to a 22 nm chip.
witeken - Friday, December 16, 2016 - link
BTW, AnandTech also did a follow-up several months later. http://www.anandtech.com/show/6536/arm-vs-x86-the-...
beginner99 - Saturday, December 17, 2016 - link
Yeah the Atom clearly won that and considering that this was the old crappy Atom and not the new one...
ddriver - Saturday, December 17, 2016 - link
the old crappy atom vs the new crappy atom lol
Wilco1 - Saturday, December 17, 2016 - link
The 2nd article compared more like for like CPUs but had the same flaws - both used hardware modifications done by Intel and focussed on tiny JS and browsing benchmarks so it was more a browser efficiency comparison rather than a CPU efficiency comparison. There are still large differences between various JIT compilers today, including between different versions of the same browser.
Gondalf - Sunday, December 18, 2016 - link
What a boring discussion. There is a general consensus the ISA doesn't add or sottract power consumption. Intel has decoders, ARM has a longer code to run. At the end all ISAs have their pros and cons.
It's all a matter of good od bad process integration of the blue prints.
Wilco1 - Sunday, December 18, 2016 - link
No, the ISA certainly matters, not only in design and validation time, but also PPA (power performance and area). So there is absolutely no difference between say the x87 FPU and the SSE FP instruction set?

Intel couldn't make Atom competitive despite their huge process advantage and many billions spent (so don't claim it was for lack of trying). Even the latest Goldmont is about as fast as phones were almost 2 years ago despite a 10W TDP...
Gondalf - Sunday, December 18, 2016 - link
Your answer needs of a good debugging :). You don't say anything to prove your claims about the smoky ARM architecture advantage and you do the mistake to mix a standard Intel 14nm LP (4Ghz at 1V) with Intel 14nm for SOCs (3GHz at 1V).
Silvermont was 1-1.5W/core on SOC process and 3-4W/core on the plain LP process for high clock speeds.
Come on Wilco :)
Wilco1 - Sunday, December 18, 2016 - link
Hmm, you didn't answer my question about x87/SSE. Neither you didn't mention anything to prove your incorrect assertion that ISA does have no impact on power consumption.

And no I didn't make any mistake about process either. There is no doubt Intel 14nm processes are better than Samsung 16nm. Yet the very latest Atom looks bad compared to an old Galaxy S6. Also you do realise there are several 4-thread Skylake SKUs which have a similar or lower TDP, right?
name99 - Monday, December 19, 2016 - link
"ARM has a longer code to run."
I assume this is supposed to mean that ARM has lower code density. Except that this is wrong.
ARM 32-bit using Thumb2 has better code density than x86-32.
ARMv8 Aarch64 has better code density than x86-64.
These are both academically verified numbers.
beginner99 - Saturday, December 17, 2016 - link
Krysto you are only half right. The article compared tegra 3 on 40 nm vs Clover Trail on 32 nm. So yes it wasn't fair but it was not planar vs tri-gate.

ARM server SOCs will only be able to compete if they get close to intels ST performance. They thing is you don't really need tons of slow cores. Less faster cores is actually a lot better. Why? The cores can easily be split up by assigning more than available to all Virtual machines. Besides that VMs make management a lot easier especially if you are running dev, test, and prod servers. Just clone them instead of complete physical reinstall...

What however also often gets forgotten isn't actual throughput but also latency for the end user. if you have a complicated web service that needs some serious grunt, the CPU with faster ST performance wins. And with VMs all your applications running on that server profit from that fast ST speed or latency advantage compared to many slow cores.

Then there is also the by-core-licensing of certain software which also hurts slow cores. So it's a very steep battle for ARM servers.
serendip - Saturday, December 17, 2016 - link
Wouldn't a lot of distributed stuff like Hadoop run better on more but slightly slower cores? Qualcomm isn't targeting typical hosting companies running VMs for the initial rollout, it's going for the big cloud providers who run their own custom software stacks. Some things respond well to throwing lots and lots of low power cores as long as there's a lot of shared RAM and fast interconnects.
deltaFx2 - Saturday, December 17, 2016 - link
If the only thing you ran was hadoop, then that _may_ possibly be true. Most data centers would run things other than hadoop as well, to utilize the server to near-100%. It's important to also remember that the more you parallelize, the more cost you pay in terms of overhead. Also, a parallel workload is only as fast as its slowest thread, so at some point single threaded performance will show up. And then that's Amdahl's law.
Kevin G - Monday, December 19, 2016 - link
@deltaFx2
It depends on the data center and redundancy. Running servers at 100% is not actually advised as that doesn't leave room for fail over in application clusters. If you have two nodes in a cluster, then it is advised to keep each node under 50% load so that if one dies, the other can can handle the additional workload without issue.

With turbo functionality being common place, this actually works out slightly better than a single server at 100% load due to the marginally higher clocks obtainable at 50% load.
FunBunny2 - Saturday, December 17, 2016 - link
-- Wouldn't a lot of distributed stuff like Hadoop run better on more but slightly slower cores?

2 points:
a) there are only a handful of embarrassingly parallel user state problems, so massive cores only solve such problems; web servers being the most likely
b) Intel cpu run a RISC hardware; one might expect it looks at least as fancy as ARM's since Intel has been building X86 "decoding" to-RISC cpu for nearly 2 decades. caching the JIT RISC instructions makes an X86 run much like (or, just like) and ARM cpu.
patrickjp93 - Saturday, December 17, 2016 - link
No. The overhead of broadcasting data to more nodes, of launching more threads on more cores, adds up very quickly, more quickly than Gustafson's Law and Amdahl's Law predict scaling does even in the best case. It's one major reason IBM sticks with a scale-up core design philosophy.
Wilco1 - Sunday, December 18, 2016 - link
"the CPU with faster ST performance wins". Note Xeons don't have the fastest ST performance. High-end Xeons typically have a 2GHz base frequency and can just about reach 3GHz for a single thread. Note 48 cores is similar to 48-thread Xeons (which support 8 sockets), so clearly there are markets for lots and lots of medium-performance threads.

So ARM servers only need to get close to 2GHz Xeon performance to beat high-end Xeons. And that's a much lower barrier than you suggest.
serendip - Sunday, December 18, 2016 - link
My initial reasoning as well. The way Qualcomm talks about the chip, they're not concerned about single threaded performance of a few big cores, they'd rather focus on a lot of medium-performance cores. If this chip eats into Xeon territory, Intel should be very worried.
extide - Saturday, December 17, 2016 - link
Did you even read the article? First of all, TSMC never had a 32nm gen, and no ARm chips ever made it to the market on 32nm as far as I know. This was first gen Atom he was testing with so the ATOM was 32nm, but the ARM cores were 40nm,best available at the time.
MrSpadge - Monday, December 19, 2016 - link
Side note: Samsung had some Exynos' on their 32 nm process.
deltaFx2 - Saturday, December 17, 2016 - link
"You'll note that frequency decreases in Intel processors are core count increases". Your observation is correct; your conclusion is entirely wrong; ARM, SPARC, Power etc will all face this bottleneck. To understand this, you have to first realize that CPUs ship with a TDP. Imagine a 4 core CPU at 95W TDP vs. an 8-core CPU at 95W TDP. The 8-core is a bigger die (more transistors), and is doing more work. Simple physics dictates that at the same voltage and frequency, the 8-core will burn more power. But, we've set the TDP at 95W (cooling solutions get more expensive), so the 8-core drops voltage and frequency to meet this TDP. Many memory-bound or I/O bound workloads do not scale with frequency, so the trade-off can be a net win for many applications. (It may be possible to overclock the 8-core to that of the 4-core with fancy cooling. Don't count on it though. It's possible that the 8-core is a binned part that leaks too much at high V, or is unstable at high f).

As to the x86 vs ARM myth... the "efficiency" of ARM is a canard at least as far as high performance CPUs are concerned. Sure it increases the die size and possibly a smidgen more power, but in a high performance CPU, the decoder is tiny compared to the size of the core itself, and the massive caches. Most of the power burnt is in the out-of-order engine (scheduler, load-store) which isn't drastically different in x86 vs ARM.

Also recall that x86 displaced supposedly superior RISC processors in servers (Sparc, itanium, alpha, power, pa-risc etc). At a time when the decoder was a larger fraction of the die. Also, if single threaded performance didn't matter, AMD bulldozer and derivatives would be ruling the roost. Bulldozer family's single threaded performance is still higher than any ARM provider, and excavator went a long way in addressing the power. AMD has less than 1% market share in servers. Even Sun's Niagara, which was in-order was forced to move to OoO to address weak single threaded performance. That ought to tell you something.
pkese - Sunday, December 18, 2016 - link
I think you are right. The thing is that x86 instructions being so complex to decode, forced Intel to store decoded instructions into a dedicated μop cache. That not only saves power (for fetching these and decoding these instructions) but also skips some steps in the pipeline when caches get hit.
ARM on the other hand, having a more simple instruction set, can decode instructions on the fly, so it doesn't need μop cache. But then it still needs to fetch and decode them at each cycle. When you multiply that with somewhat worse code density for ARM (i.e. more bytes per each ARM instruction as compared to x86), you probably start wasting picojoules big time.
Wilco1 - Sunday, December 18, 2016 - link
No you're both wrong. A micro-op cache helps x86 power and performance, but it adds a lot of extra complexity and design time. And it will still be more power hungry compared to a RISC CPU that can decode instructions in just 1 cycle (see the Cortex-A73 article). It's also incorrect to suggest that the only differences are in the decoder - the ISA permeates throughout the CPU, think partial flag and register writes, memory ordering, micro coding support, the stack engine, extra load pipeline to support load+op, unaligned access/fetch, better branch handling required if not having conditional execution or efficient CSEL etc etc.

Note that code density of Thumb-2 is far better than x86, and density of AArch64 is far better than x64. The idea that CISC has smaller codesize is a myth just like "the ISA doesn't matter".
deltaFx2 - Sunday, December 18, 2016 - link
@Wilco1: You have no idea what you're talking about. Let's start with the x86 ISA vs A64 (not Aarch32). The chief complexity with x86 decode is variable instruction length, not CISC. Most typical code paths use x86's simpler instructions (including ld-op) that do not crack into more than 2 ops.

A73 is a wimpy 2-wide OoO core at a relatively low fmax. Nobody in their right mind would use that thing in a server. Apple's A* are a far better example as they are 6-wide. Their minimum mispredict penalty is 9-10 cycles (down from 14 earlier) and they max out at 2.3 GHz. Does that sound like one stage for decode? Hint: Complexity is a super-linear, often quadratic function of machine width.

"density of AArch64 is far better than x64": I'm searching for a word that describes the excrement of a bovine, but it escapes me. Some third-party data please.You mean to say that an ISA that has constant 4-byte inst, and no ld-op/ld-op-store or complex ops has better code density than x86? The average instruction length of an x86 instruction is a little over 2 bytes for integer workloads. AVX* may be longer than 4-bytes but it does 256-bit/512-bit arithmetic; ARMV8 tops out at 128 bit.

x86 has a CSEL instruction. It's called cmovcc, and it's been there since before Aarch64 was a glint in ARM's eye. As had Power and every ISA worth its salt.

Stack engine: Stack engine pre-computes SP relative pushes and pops so that you don't need to execute them in the scheduler. Power feature. Aarch64 also has a stack, as does any CPU ever designed.

Partial Register Writes: Apart from AH and AL (8-bit legacy), x86 zeroes out the upper bits of RAX when EAX or AX is written (Same as ARM's W0->X0). Nothing's partial. At any rate, partial writes only create a false dependency.

Memory ordering: This is a minor quibble. It's a little more hardware complexity vs. expecting s/w to put DSBs in their code (which is expensive for performance. And people may go to town over it because debugging is a nightmare vs strong models). SPARC also uses TSO, so it's not just x86.

Aarch64 supports unaligned loads/stores. There is no extra pipeline needed for ld-op in x86 if you crack it into uops. Where are you coming up with all this?

Fun-fact: Intel dispatches ld-op as 1 op and a max dispatch width of 4. So to get the same dispatch rate in ARM, you'd need to dispatch 8. Even Apple's A10 is 6-wide.

Aarch64 is simpler than x86 but it's not simple by any means. It has ld-op (ld2/ld3/ld4), op-store (st2/st3/st4), load-pair, store-pair, arithmetic with shifts, predicated arithmetic, etc. Just to name a few.

Now on to Thumb/Aarch32:
* Yeah, thumb code is dense, but thumb code is 32-bit. No server would use it.
* Thumb is variable-length. See note above about variable length decode. A slightly simpler version of the x86 problem, but it exists.
* Aarch32 has this fun instruction called push and pop that writes/reads the entire architected register file to/from the stack. So 1 instruction -> 15 loads. Clearly not a straight decode. There are more, but one example should suffice.
* Aarch32 can predicate any instruction. Thumb makes it even fancier by using a prefix (an IT prefix) to predicate up to 4 instructions. Remember instruction prefixes in x86?
*Aarch32 has partial flag writes.
*Aarch32 Neon has partial register writes. S0,S1,S2,S3 = D0,D1 = Q0. In fact, A57 "deals" with this problem by stalling dispatch when it detects this case, and waits until the offending instruction retires.

Plenty of inefficiencies in Aarch32, too many to go over. ARM server vendors (AMCC, Cavium) weren't even supporting it. Apple does, AFAIK.

uopCache: This is a real overhead for x86. That said, big picture: In a server part, well over half the die is uncore. The area savings from the simpler ISA are tiny when you consider I$ footprint blowup. (Yes, it's real. No, you can't mix Aarch32 and A64 in the same program).

ISA does not matter, and has not mattered. x86 has matched and surpassed the performance of entrenched RISC players. Intel's own foray into a better RISC ISA (Itanium) has been an abject failure.
name99 - Monday, December 19, 2016 - link
(("density of AArch64 is far better than x64": I'm searching for a word that describes the excrement of a bovine, but it escapes me.))

You do yourself no favors by talking this way. You make some good points but they are lost when you insist on nonsense like the above.

https://people.eecs.berkeley.edu/~krste/papers/EEC...
on page 62 is a table proving the point. If you bother to read the entire thesis, there is also a section describing WHY x86 is not as dense as you seem to think.

Likewise the issue with Aarch64 (and with any instruction set in the 21st C) is not the silly things 1980s you are concerned --- (arithmetic with shifts, load pair, and so on). The 1980s RISC concerns were as much as anything about transistor count, but no-one cares about transistor counts these days. What matters today (and was of course of constant concern during the design of Aarch64) is state that bleeds from one instruction to another, things like poorly chosen flags and "execution modes", and for the most part they don't have those. Of course if you insist on being silly there remains some of that stuff in the Aarch32 portion of ARMv8, but that portion is obviously going to be tossed soon. (Has already been tossed by at least one ARM server architecture, will be tossed by Apple any day now, and eventually Android will follow.)

Similarly you don't seem to understand the issues of stacks. The point is that x86 uses a lot of PUSH and POP type instructions and these are "naturally serializing" in that you can't "naturally" run two or more of them in a single cycle because they change implicit state. (They generate eg a store instruction and an AGEN and a "change the address of the SP".)
Any modern CPU (and that includes Aarch64) does not use pushes and pops --- it keeps the SP fixed during a stack frame and reads/writes relative to that FIXED stack pointer. So there are no push/pop type instructions and no implicit changing of the SP as content is read from or to the stack.
Which means there is also no need for the hassle of a stack engine to deal with the backwardness of using push and pop to manipulate the stack.

And to call Itanium a RISC ISA shows a lack of knowledge of RISC and/or Itanium that is breathtaking.
Kevin G - Monday, December 19, 2016 - link
@name99

Interesting choices in that paper on compiler flags used for testing. They weren't set for minimum binary size which would have been a more interesting comparison point (-oS instead of -o2) nor most aggressive optimization (-o3). I do get why they didn't choose -o3 due to how it can inline certain function calls and thus bloat size but the other flags I see as worth enabling to see what it does for code size. If anything, some of the vectorization could have increased binary size on x86 as the vector instructions tend to be on the larger on x86.

The other thing that stands out is the relatively large standard deviation on x86-64 code size. x86 binaries certainly can be smaller than ARMv7 or ARMv8 but there is no guarantee. x86 does beat out traditional MIPS and it would have been nice to see PowerPC included as another data point for comparison.

Additionally it would have been nice to see a comparison with icc instead of gcc for binary size. Granted this is changing the comparison by introducing a complier variable but I suspect that Intel's own compiler is capable of producing smaller code than that of gcc.

In closing, that paper has valuable real data for comparison but I wouldn't call it the last word on x86 code size.
name99 - Monday, December 19, 2016 - link
When you've gone from "x86 in unequivocally denser" to "well, sometimes, under the right circumstances, x86 is denser", this is no longer a dispositive issue and so of no relevance to the conversation.
deltaFx2 - Tuesday, December 20, 2016 - link
I owe Wilco an apology for the strong statement. However, the data you presented still does not corroborate his statement that "density of AArch64 is far better than x64". It appears to be in the same error margin. I already conceded in another post that the answer is "it depends". You can't make a blanket statement like that, though.

And I was lazy in calling Itanium RISC. I meant designed on RISC principles (I know, it's a VLIW-like processor that goes beyond vanilla VLIW). Intel did do a RISC CPU too at some point that also failed (i960?).

Legacy x86 probably uses a lot of push-pop. Plenty of modern x86 code does what you say (frame-pointer or SP relative loads). But fair enough, if you're not using Aarch32, you won't see a big performance benefit from it other than saving on the SP-relative Agen latency.

I'm being silly by arguing that Aarch64 isn't as "simple" as people think? Load pair, store pair, loads/stores with pre or post indexing, the entire ld1/ld2/ld3/ld4 and their store variants, instructions that may modify flags, just to name a few? C'mon. RISC as a concept has been somewhat nebulous over the years, but I don't believe instructions with multiple destinations (some implicit in the pre/post indexing or flags case) were ever considered RISC. These are multi-op instructions that need a sequencer. BTW, I believe an ISA is a tool to solve a problem, not a religious exercise. ARM is perfectly justified in its ISA choices, but it's not RISC other than the fact that it is fixed instruction-length and load-store (for the most part). Mostly RISC, except when it's not, then?

* ld2/ld3/ld4 etc are limited gather instructions. ARM's canonical example is using ld3 to unpack interleaved RGB channels in say .bmp into a vectors of R, G, and B. st* does the opposite.
Wilco1 - Wednesday, December 21, 2016 - link
Let me post some data for GCC test in SPEC built with latest GCC7.0 using -O3 -fno-tree-vectorize -fomit-frame-pointer. Of course only one datapoint, but is fairly representative for a lot of integer (server) code. The x64 text+rodata size is 3853865, AArch64 is 3492616 (9.4% smaller). Interestingly the pure text size is 2.7% smaller on x64, but the rodata is 92% larger - my guess is that AArch64 inlines more immediates and switch statements as code while x64 prefers using tables.

Average x64 instruction size on GCC is 4.0 bytes (text size for integer code without vectorization). As for load+op, there are 101K loads, but only 5300 load+op instructions (i.e 5.2% of loads or 0.7% of all instructions). Half of those are CMP (majority with zero), the rest is mostly ADD and OR (so with a compare+branch instruction, load+op has practically no benefit). There are a total of 2520 cmov's in 733K instructions.

As for RISC, the original principles were fixed instructions/simple decode (no microcode), large register file, load/store, few simple addressing modes, mostly orthogonal 2 read/ 1 write instructions, no unaligned accesses. At the time this enabled very fast pipelined designs with caches on a single chip. ARM used even smaller design and transistor budgets, focussing on a low cost CPU that maximized memory bandwidth (ARM2 was much faster than x86 and 68k designs at the time). I believe the original purist approach taken by MIPS, Sparc and Alpha was insane, and led to initial implementations poisoning the ISA with things like division step, register windows, no byte/halfword accesses, no unaligned accesses, no interlocks, delayed branches etc. Today being purist about RISC doesn't make sense at all.

As for AArch64, load/store pair can be done as 2 accesses on simpler cores, so it's just a codesize gain, while faster cores will do a single memory access to increase bandwidth. Pre/post-indexing is generally for free on in-order cores while on OoO cores it helps decode and rename (getting 2 micro-ops from a single instruction). LD2/3/4 are indeed quite complex, but considered important enough to enable more vectorization.

The key is that the complex instructions are not so complex they complicate things too much and slow down the CPU. In fact they enable implementations (both high-end as well as low-end) to gain various advantages (codesize, decode width, rename width, L1 bandwidth, power etc). And this is very much the real idea behind RISC - keeping things as simple as possible and only add complexity when there is a quantifiable benefit that outweighs the cost.
deltaFx2 - Thursday, December 22, 2016 - link
Thanks for the ARM vs x86 data. I suppose a fairer comparison would be to compute dynamic instruction count (i.e. sum of (instruction size * execution frequency)), but it's probably best to leave it at that.

The trouble with instructions that crack into multiple ops is, before decode, you have no idea how many entries you need to hold the decoded uops. So you can't allow an instruction to expand into an arbitrary number of uops inline because you may not have enough slots plus you have to allign the results of the parallel decodes (dispatch is in-order). Pure RISC with 1:1 decode is clearly simple. For ops that are not 1:1 you may need to break the decode packet, and invoke a sequencer the next cycle to inject 2+ ops. Intel kinda does this with their 1 complex decode + 3 simple decoders. ld2/ld3/ld4 can be a stream of ops that are pretty much microcode even if you implement it as a hardware state machine instead of a ROM lookup table. The moment you have even one instruction that cracks into multiple uops, you need to build all the plumbing that is unavoidable in CISC, and what RISC was seeking to avoid. At this point, it's not an argument of principle but degree. CISC has a lot of microcode, ARM has a little microcode(or equivalent).

"keeping things as simple as possible and only add complexity when there is a quantifiable benefit that outweighs the cost" -> Well, that is the bedrock of computer architecture well before RISC, and it says nothing. Intel might argue that AVX-512 is worth the complexity. Fixed instruction length is a good idea. Relying on software to implement complexity is reasonable. Other than that, RISC designs have become more CISCy to a point where the distinction is blurred, and largely irrelevant. IBM Power has ucode but is fixed length. SPARC implements plenty of CISCy instructions/features.

BTW, it's going to be very very inefficient to implement ld-pair as one op. I doubt anyone would put in the effort.
deltaFx2 - Thursday, December 22, 2016 - link
The point being, ARM's ISA has more in common with x86 than Alpha (IMHO the cleanest true RISC ISA). ARM has carried forward some warts from A32. Not that there's anything wrong with it, but ARM decode itself has significant complexity. As noted earlier, x86's primary decode complexity comes from not knowing where the instructions are in the fetched packet of bytes (variable length). Sure, the extra instructions in x86 need validation (much of it legacy), but I don't believe it is a significant fraction of the overall verif effort on a (say) 20 core part, or a notebook part with GPU+accelerators+memory controllers+coherence fabric etc. Similarly, the power advantage of a simpler ISA is overstated given all of the above IP residing on the chip. If ARM wins in the data center, it will not be because it had a purer ISA, but because its ecosystem is superior (s/w and h/w). Like C/C++, html, etc, x86 survived and thrived not because it was pure and efficient, but because it got the job done (cheaper, better, faster).
azazel1024 - Tuesday, December 20, 2016 - link
Not always. Example, at work we are looking to build some database servers running MS SQL on them, since it is licensed per 2 cores (as is a lot of server software these days), the fewer cores we run, the cheaper. A couple of dual socket Xenon E7-8893v4 servers is significantly cheaper to setup than a couple of single socket E7-8867v4. Yes, the ultimate performance is a fair amount less, but it is something approaching 75% of the performance, in exchange it ends up costing about $50,000 less per server on the software side of things.
deltaFx2 - Saturday, December 17, 2016 - link
With SMT on, Skylake is the equivalent of 64 "cores" (No SMT on the qualcomm cores). If skylake's thread is just as powerful as an Qualcomm core, why would one switch? Also, there's AMD Naples/Zen due in mid 2017, also 32C64T. To top it, from the description above, QC appears to be a 1P system only whereas the x86 systems will likely also support 2P (so up to 64C128T per rack).

So really, the QC core has two competitors. You might argue that AMD and QC are the real competitors (Intel being deeply entrenched) but the barrier for switching to QC is higher. Unless QC has some fancy killer app/accelerators that neither x86 vendors provide. Will be interesting to see how it shapes up.
Antony Newman - Saturday, December 17, 2016 - link
In 2017 H2, we may find Qualcomm cores are 30% slower (IPC) than Apples A10, and Apple is 15% lower IPC than Intel XEON. A 48 cores Qualcomm will, if it does not melt in its single socket, perform comparably to a 24-32 core Xeon, where no special AVX 'hardware acceleration' is invoked.

If at that point, Intel does not reduce its prices and maintain its ~70% profit margin, Qualcomm will - if the software ecosystem is sound - find acceptance in the server world.

If Qualcomm add hardware acceleration that can offload more computational work than Intel, their 48 core chip will be received even more favourably; delegating to the ARM cores what they are more efficient at handling.

When CCIX eventually matures, those 'bolt on accelerators' are - in my opinion - going to drive their uptake in large scale systems.

At TSMC 10nm / Intel 14nm - Qualcomm will be able to get a foothold.

When TSMC 7nm is available, Qualcomm will no doubt close the gap on architectural IPC and may only be 30% slower than Intel for the CPU core - But they will now have enough silicon area to have a 64 core ARM (perhaps with SVE extensions), and a melenox et at ready to help them have accelerated offerings that target desktop to hyperscaler systems.

(Dreaming) Who knows - maybe Apple will use them in a future iMacPro? ;-)

AJ
MrSpadge - Monday, December 19, 2016 - link
I think it's rather going to compete with Xeon-D than Skylake-EP.
iwod - Friday, December 16, 2016 - link
Let's hope it will offer decent single thread performance first. Otherwise I am much more looking forward to AMD Zen.
boeush - Saturday, December 17, 2016 - link
Many years ago, at a point along the Sun's galactic orbit far, far away, there used to exist a company called Sun Microsystems, which tried to push the idea of giant chips full if a myriad tiny, weak cores.

That company no longer exists. One of the reasons, it turns out most software is not easily parallelizable and people would rather run web sites (or other server workloads) with snappy response and an option to scale out through more hardware, than being able to support more simultaneous clients out of tge box but with each client experiencing invariably ssssllllloooooowwwwww response - no matter how much money you throw at the hardware...
patrickjp93 - Saturday, December 17, 2016 - link
It got bought by Oracle, and btw, SPARC processors are still made by Oracle and Fujitsu and are in use in some workstations and supercomputers, and they have many "weak" cores.
Ariknowsbest - Saturday, December 17, 2016 - link
Sun Microsystems are thriving under Oracle. And the latest SPARC chips have up to 256 threads at 20nn, perfect for business applications.
webdoctors - Sunday, December 18, 2016 - link
Ya I was going to add the same point. The Sun Niagara series of processors has been around for almost a decade now and its got tons of weak cores for running many web server threads.

Also, as others have pointed out, the decoder logic is small dynamic power overhead compared to the static cache and out of order dynamic power, so ARM cores are not blatantly more power efficient than x86 in the high performance realm.

Anyways, more competition is great, and AMD, IBM, Oracle, Intel will have another competitor, consumers will have more choices. I haven't heard of anyone working on ARM cores for HPC and servers, maybe it'll happen but its definitely a pretty niche market at the moment.
FunBunny2 - Sunday, December 18, 2016 - link
-- Also, as others have pointed out, the decoder logic is small dynamic power overhead compared to the static cache and out of order dynamic power, so ARM cores are not blatantly more power efficient than x86 in the high performance realm.

what folks are missing is the obvious: X86 is a CISC ISA which is "decoded" to RISC (we don't, so far as I know, know the ISA of this real machine). the ARM "decoder" is along the lines of old transparent machines of the likes of IBM 360: the compiler (COBOL, mostly, in the case of the 360) takes the source, turns it into assembler, which the decoder turns into machine instructions, one-for-one. that's what decoder meant for decades.

Intel (and perhaps others before, I don't know) chose to use those billions and billions of transistors to make a RISC machine on the hardware, hidden behind a "decoder" which is really a hardware JIT CISC -> RISC.

the Intel way leads to simpler compilers which need only spit out legacy X86 assembler (modulo additional instructions, still CISC, that have come along the way), while the ARM compilers have to take that source figure out how to get down to real hardware instructions. so, yes, the object file from an ARM compiler will be bigger and more complicated, but easier to handle at run time. the real issue: which runs faster in the wild; the eventual RISC code in the X86 machine (counting all that decoding/caching/etc.) or the native RISC code in ARM. I've no idea.
Wilco1 - Sunday, December 18, 2016 - link
No, in all cases instructions are decoded by hardware into one or more internal micro-ops which are then executed (so there is no JIT compiler - Transmeta and NVidia Denver are rare exceptions).

Both CPUs and compilers prefer simpler instructions since complex instructions are too slow and too difficult to optimize for (this is why RISC originally started). As a result all commonly used instructions are decoded into a single micro-op, both on RISC and CISC CPUs.

The difference is in the complexity, a RISC decoder is significantly simpler and smaller than a CISC one. For example instructions on x86 can be between 1 and 16 bytes long on x86, and each instruction can have multiple prefixes, slightly modifying behaviour. You need a micro code engine for the more complex instructions even if they are very rarely used. On high-end cores you require a stack engine to optimize push/pop operations and a micro-op cache to reduce power and decode latency. All this adds a lot of transistors, power and complexity even on a high-end core.

Writing a compiler backend for a RISC CPU is easier due to the simpler, more orthogonal instructions. RISC doesn't imply larger object sizes either: ARM and Thumb-2 are smaller than x86 and AArch64 is smaller than x64. Why? Unlike RISC, the encodings on x86/x64 are not streamlined at all. Due to all the extra prefixes and backwards compatibility instructions are now over 4 bytes on average. x86 also requires more instructions in various cases due to having fewer registers, less powerful operations and branches.
FunBunny2 - Sunday, December 18, 2016 - link
-- No, in all cases instructions are decoded by hardware into one or more internal micro-ops which are then executed (so there is no JIT compiler - Transmeta and NVidia Denver are rare exceptions).

tom-aye-to or tom-ah-to. if the "decoder" has to figure out what instruction stream to emit to the hardware, that's a compiler JIT or otherwise. the mere fact that micro-op caches exist is confirmation that "JIT" is happening.
Wilco1 - Sunday, December 18, 2016 - link
It's only called JIT if the translation is done by software and involves optimization. If it is done by hardware and a fixed expansion for every instruction then it's simply decoding. CPUs have translated their ISA into internal instructions since the early days.

Also a micro-op cache has nothing to do with "JIT". Neither do pre-decode caches. Both are signs that decode is so complex that there is an advantage in caching the results. Not necessarily a good sign.
FunBunny2 - Sunday, December 18, 2016 - link
-- CPUs have translated their ISA into internal instructions since the early days.

not to be too picky (OK, may be), but in the "early days" decode meant taking the assembler output, machine code instructions, one line at a time, to set the hardware. no substitution or other stuff happened.

the first exception that I know of came with the 360 machines. the top of the line machines executed the ISA directly in hardware, while the /30 had hardware, legend has it, similar to a DEC PDP which was driven by "microcode" equivalent to 360 assembler. the in-between machines had in-between implementation.
deltaFx2 - Sunday, December 18, 2016 - link
"complex instructions are too slow and too difficult to optimize for": Huh? That was circa 1980. Use intel's icc compiler and look at the disassembly. It extensively uses complex addressing modes and ld-op/ld-op-store. In Spec, footprint is easily 20% smaller than gcc.

ARM != RISC. See my earlier post, I'm not going over this again.

"Why? Unlike RISC, the encodings on x86/x64 are not streamlined at all. Due to all the extra prefixes and backwards compatibility instructions are now over 4 bytes on average. x86 also requires more instructions in various cases due to having fewer registers, less powerful operations and branches."

See what you did there? Claim: ARM binaries smaller I$ footprint. Reason: x86 has larger instructions. Even if we accept that x86 has larger instructions (not true), if each instruction encodes more information, this is amortized. In typical workloads, x86 ranges between 2-3 (non-vectorized/non-FP). Also remember that x86 allows you to encode 32-bit immediates in the instruction. In ARM, you either have to compute that value, or load it. ARM's largest immediate is 16-bit or less. AVX is larger than 4 bytes, but it supports 256 bit operations.

x86 has fewer registers, but supports ld-op/op-store/ld-op-store. ARM needs a register to load to, before doing the EX. So you need to rename all these temps and keep them around just in case there's an exception. There's no free lunch, mate.

Branches: That's just nonsense. If you're referring to predication, predication is usually a bad idea for an out-of-order CPU, because you have to execute ~2x the number of instructions. It only makes sense if the branches are hard to predict, and x86 has a cmov instruction for that (analogous to csel). ARM relied on predication in A32 as an easy way of avoiding a fancy branch predictor, and stalling the pipeline on a branch. This makes sense in 1990, but not in the market ARM currently competes in.
Wilco1 - Sunday, December 18, 2016 - link
Yes it's well known and widely discussed problem that icc generates too long and too CISCy instructions that stall decode and require pre-rename expansion. Load-op-store is a bad idea on pretty much all x86 CPUs. Load-op helps in some cases as modern cores keep them fused until after rename (and dispatch is now finally 6-wide), however they are not frequently used in compiled code. I did some measurements a while back, load-op was around 5% of loads and the majority of cases were test and cmp. Not exactly "extensive use".

"ARM != RISC"??? So you have no idea what RISC means then. Oh well...

Yes, it's a fact that ARM binaries are smaller than x86. x86 both uses larger and slightly more instructions. That includes non-vectorized integer code. Compile some binaries and educate yourself. I did. Note the semantic content of instructions is very close - this means on average each instruction encodes the same amount of information (despite using some load+op or complex addressing modes). x86 just happens to use a less dense encoding.

"So you need to rename all these temps and keep them around just in case there's an exception. There's no free lunch, mate."

And you believe x86 doesn't need extra rename registers for all those temporaries and keep those around too??? No free lunch indeed.

As for predication/CSEL, this is still extensively used on modern cores to improve performance since there are still many hard to predict branches. Executing 1 or 2 extra instructions is far cheaper than a 10% chance of a branch mispredict. The only reason it isn't used much on x86 is because cmov sucks big time on most cores, so x86 compilers typically emit complex sequences of alu instructions for functions like abs, conditional negate etc...
deltaFx2 - Sunday, December 18, 2016 - link
RISC, when the idea was introduced in academia by Patterson et al. meant an ISA in which each instruction executes in a single cycle. Anything longer than that has to be handled in software, so no mul or div or FP instructions. Clearly that wasn't a great idea. So the idea morphed to being multi-cycle ops ok as long as they don't involve multiple ops (in other words, no complex state machine, aka microcode). Except that the push/pop instructions that A32 has requires some sequencer (ucode or h/w is beside the point). As do load with autoincrement, and a host of other instructions in A32 and A64. So what is RISC, according to you? Fixed instruction length? ARM isn't that either (pure 64 bit is, of course). Load-store architecture? That's a pretty tenuous link to the original definition then.

Re. CSEL/CMOV, compile hmmer (Spec2006) using gcc O2 and see what you get. I've seen cmov all over the place in x86. As to it sucking, there's no reason it should, but I don't know what intel does. And the hard-to-predict branches are precisely the point. hmmer has a bunch of load->compare->branch sequences that are predicated this way.

Re binary size, that's not my experience. gcc -O2 produced (circa early 2015) pretty large A64 binaries; slightly larger than corresponding x86-64 (dynamic count). I suppose it also depends on your compiler switches. If x86 by default compiles for performance, ARM compiles for code footprint, then that might do it.

Re. ld-op temps, usually you'll keep them until the ld-op retires. In ARM, the temp is held past retire because it's architected state. And you'll have to spill/fill them across function calls, etc. to maintain state, even if it's long dead. At any rate, my point was that x86's ISA allows for more registers than apparent just by looking at the architected state.

Again, your experience of ld-op differs from mine. I've found it to be as high as 30% of loads. Perhaps there's something in your code that requires the loaded value to be kept around in architected state (like 2+ consumers). It's possible, idk.

Not sure why you think ld-op-store is always a bad idea.

Anyways, let's put the ARM vs x86 code size argument down as: "it depends". This is a paper online that suggests x86 is denser or as dense as thumb-2. http://web.eece.maine.edu/~vweaver/papers/iccd09/i... I'm sure a workload that makes heavy use of ARM's autoincremented loads and doesn't allow x86 to use ld-op will yield the opposite result.
name99 - Monday, December 19, 2016 - link
"RISC, when the idea was introduced in academia by Patterson et al. meant an ISA in which each instruction executes in a single cycle."

Yes and no. You can read what Patterson et al said in their own words here:
https://courses.cs.washington.edu/courses/cse548/0...

IMHO the IMPORTANT part of this document is the first and second paragraphs.
The complaints against CISC (and thus the point of RISC) are that CISC
- has increased design time
- increased design errors
- inconsistent implementations
- poor use of scarce resources (especially transistors).

To that end RISC
- execute one instruction per cycle. (NOT one cycle instructions! Infeasible even on the very first chip which had two cycle load/store.) This is a pointless restriction that was soon tossed.
- all instruction the same size. (This helps massively. It's what allows Apple to decode 6 instructions per cycle [and IBM to do even more] while to get results worse than that Intel needs a ton of helper machinery, increased every iteration and now consisting of essentially two parallel independent instruction caches.)
- load-store architecture. (This is also essential, in that Intel uses the same thing, through micro-ops)
- only include instructions that can be exploited by high-level languages. (This has been a constantly moving target. It's still not clear to me if the ARM and ARMv8 2/3/4 vector transposition instructions are worth their hassle, and whether real compilers can use them without developers having to resort to assembly.)

All in all I'd say the score-card 35 years on looks like
- same sized instructions (or only two sizes ala ARM+Thumb) is immensely helpful

- load-store is essential, so much so that you fake it if you have to

- one instruction per cycle (or one cycle instructions) is neither here nor there.

- the one BIG omission to the paper is that you don't want instructions that modify various pieces of implicit state (implicit flags for carry or branching or various sorts of exceptions, implicit stack pointer)

- IMHO memory model/memory ordering is a HUGE issue, but it's one of these things that's invisible to anyone outside one of the big companies, so we have no data on how much difficulty it adds.
We do have ONE data point --- that Intel screwed up their first attempt at HW transactional memory, whereas IBM didn't [for both POWER and z-Series which is a CISC that is mostly RISC-like, much more so than Intel]. But maybe IBM was just more careful --- we have to wait to see how the various ARM company's attempts to implement the more sophisticated memory semantics of ARMv8.1 and v8.2 [and eventually transactional memory] go.

- short term decisions to save a few transistors are stupid no matter who does them. Intel did this all over the place (and not just at the start *cough* MMX *cough*). MIPS did it with delayed branch slots. ARMv8 seems to have done remarkably well in this respect with nothing that I can think of that screams "stupid short-sighted idea". (Some would say the effective limitation to 56 address bits is such. We shall see. Perhaps five years from now we'll all be using the new Vector extensions, and at that point Neon will seem like a bad idea that the designer all wish they could strip from the ISA?)

"In ARM, the temp is held past retire because it's architected state. And you'll have to spill/fill them across function calls, etc. to maintain state, even if it's long dead. "
Not exactly true. Yes, the HW has to commit these temporary register writes, but commit is off the critical path.
But in most cases the calling conventions define volatile and non-volatile registers so, no, you do NOT have to spill and fill registers that are just temporaries, not unless you're using unusually large numbers of them in unusual ways.

Oh, one more thing, what's with the constant reference to 32-bit ARM? No-one in the server space (or in the "anything larger than a watch" space --- and let's see how long it is till the Apple Watch switches to ARMv8; my guess is by 2020 at the latest) gives a damn about 32-bit ARM.
Argue about relevant design decisions, not about choices that were a big deal back in 2003!
deltaFx2 - Tuesday, December 20, 2016 - link
Constant reference to 32-bit ARM: Well, Wilco brought it up while arguing code density. I'm assuming that Thumb wouldn't be less dense than Aarch64; the referenced paper didn't have Aarch64 (circa 2009), so you have to read between the lines. I know most server players don't support Aarch32, and Apple is phasing it out. (I think they got rid of it in their ecosystem, but apps may be A32).

Re. x86 being ld-store: That's a stretch. How else would you build an out-of-order pipeline? You could build a pipeline that handles fused ld-ex in the same pipeline, but that would be stupid because there are plenty of combinations of EX. Splitting it across two pipes is the logical thing to do. In an alternative universe where no RISC existed, do you really believe this would not be how an OoO x86 pipeline is designed?

I think I stated earlier that x86's biggest bottleneck is the variable length decode, not "CISC".

Spill-Fill of dead regs: It would be nice if compilers could do analysis across multiple function calls. Some do better than others, but it's not uncommon for compilers to be conservative or simply unable to do inter-proc optimizations beyond a point. The other thing is, the extra 16 regs are architected state. So, you need dedicated storage in the PRF for them, per thread. An x86 SMT-2 machine needs 16*2 PRFs for dedicated architected state. An ARM SMT-2 machine needs 32*2. Intel in Haswell, I think, had roughly 172 entry PRF, so 140 entries for speculative state. In ARM, it would be 108 entries. ARM pays this tax even when I can make do with 8 architected registers. Since the x86 temp has only one consumer, one can reuse the destination of the ld-op as the temp.

Consistency Model: That's an interesting one. Do you burden the programmer with the responsibility of inserting serializing barriers in their code (DSB/ISB, etc in ARM), or do you burden the hardware designers? There are far more programmers than hardware designers, after all, and a release consistency like memory model is likely to be painful to debug. I have heard of (apocryphal) instances in which programmers go to town with inserting barriers because of "random" ordering. Note also, that in x86 and other strong memory models, the default speculation is that there is no ordering violation (this is in the core itself, in the uncore one has to obey it). If, an ordering violation is found after the fact (like older load saw a younger store and a younger load saw an older store), then the core flushes and retries. As opposed to a DSB which has to be obeyed all the time. IDK. In the common case, I would think speculation is more effective. That's the whole point of TSX, right? It's hard to compare though.

Just one final note on decode width: It only really matters after a flush, or in very high IPC workloads. IBM needs a wide decode (8) because it's SMT-8. Apple has 6-wide decode @2.3GHz fmax. They would've probably needed more decode stages if they targeted 4GHz (fewer than x86 for sure). The op-cache in Intel is as much a power feature as it is a performance feature. It allows 6-wide dispatch, shuts down decode, and also cuts the mispredict latency. You could build a 6-wide straight decode in x86 at more power, or have this op-cache. My guess is that the op-cache won, and I would guess it was due to variable length. Would a high-freq, high performane ARM design benefit from an op-cache? Not as much as x86, but I'm sure a couple of cycles of less mispredict latency alone should make it interesting.
Meteor2 - Tuesday, December 20, 2016 - link
Who says Anandtech is going downhill? Best discussion I've seen in years. Would love a conclusion though...

Mine is that ISA makes no significant difference to power/performance. It's all about the process now. And with EUV looking like the end of the road, at least for conventional silicon, I think everything will converge at about '5' nm in the early 2020s.

In which case it's probably not worth investing in a software ecosystem different than x86.
deltaFx2 - Tuesday, December 20, 2016 - link
@Meteor2 : I agree. ARM has to bring something awesome to the table that x86 does not have. x86 has decades of debugged and optimized software (including system software), similar cost structure to ARM (arguably better since the cores also sell in the PC market as opposed to ARM server makers), and higher single-threaded and multithreaded performance (at the moment). With AMD's future products looking promising (we'll know for sure next year), there's also the competition aspect that ARM vendors keep harping about. But let's see. SHould be interesting. Fujitstu has announced that they will use ARM in their HPC servers with vector extensions, so we'll see.
deltaFx2 - Sunday, December 18, 2016 - link
"what folks are missing is the obvious: X86 is a CISC ISA which is "decoded" to RISC (we don't, so far as I know, know the ISA of this real machine)." Before RISC became the buzzword, this was called microcode. It's always been around, just that RISC started killing old CISC machines like VAX by promising smaller cores and higher frequencies (both true until out-of-order became mainstream and it mattered much less). Intel's marketing response was to say that we're RISC underneath, but honestly, how else would you implement an instruction like rep movsb (memcpy in 1 instruction, essentially) in an out-of-order machine?
Threska - Tuesday, January 3, 2017 - link
With virtualization I could see it being viable.
evilspoons - Saturday, December 17, 2016 - link
Anandtech hits you with Wall of Text! It's super effective!

Euughh, could use a couple more paragraphs, maybe some bullet points between those first two images, this is making me cross-eyed.
cocochanel - Saturday, December 17, 2016 - link
One important fact is being overlooked by many posting comments. Qualcomm is a big company and their market analysis showed there is a market for this product. A company of this size would not spend big bucks on a new server architecture just because they have nothing else to do. The x86 vs ARM debate has been around for a while with both camps digging in rather hard. Only future will decide the winner. On a side note, ARM efficiency is a big advantage plus the ability to scale and not to mention licensing advantages. As hard as Intel and AMD try, is hard to squeeze more and more from the old x86. I mean, look at AMD. They spent a fortune and 4-5 years on Zen ( or Ryzen ) and what are the results ? A processor that is not that much better than Intel's current lineup.
deltaFx2 - Sunday, December 18, 2016 - link
@ cocochanel: Re Zen comparison, that comparison would make sense if ARM (or Power) wipe the floor with x86 in performance. Clearly they do not. Power burns a ton of power to be competitive with x86 in multithreaded performance (1T is still behind), and ARM isn't in the ballpark.
deltaFx2 - Sunday, December 18, 2016 - link
One more thing to add: QC is feeling pressured by low cost providers in the cellphone space, and it needs to move out. As servers are a high margin business, it makes sense for QC to try. Anand Chandrashekhar, who heads this project at QC said the market wants choice. The question is, does it want choice in providers or choice in ISA? In other words, is the existence of Zen sufficient to provide an alternative to Intel, obviating the need for ARM? After all, there are other ISAs around. Power is here; it has high performance; It's not IBM's first rodeo. Power probably has a software stack ready to go. Where's Power in the data center?
smilingcrow - Sunday, December 18, 2016 - link
Large companies looking to move into another sector is a Riscy business. ;)
Sure they'll have analysed it and part of their desire to change tack is a defensive move so they will take a risk as a punt on survival as well as in hoping to expand.
So it may be just as much a defensive move as anything as diversification often is; don't have all your eggs in one basket.
TheinsanegamerN - Monday, December 19, 2016 - link
performance is also very, very important in the server space. It doesnt matter if you draw very little power if you only perform like a pentium 4.

And from what has been seen, ARM has yet to deliver on the performance part. If ARM could, we'd be seeing more then just a single experimental qualcomm chip. Other companies would at least be trying.
FunBunny2 - Monday, December 19, 2016 - link
-- And from what has been seen, ARM has yet to deliver on the performance part. If ARM could, we'd be seeing more then just a single experimental qualcomm chip.

considering that C (or C++) is the universal assembler, modulo GUI libraries, any machine can be made to do what any other machine can do with applications written with it. back in the old days of IBM owning most compute business with their mainframes, they would regularly audit their customers' applications to see what instructions were most used and what could be done to optimize them. I suppose Intel does the same, although I don't remember seeing any report of the result. anyone? in particular for a server machine, how much are instructions that have been ALU for decades used? 99% of instructions? 30%? and so on.
name99 - Monday, December 19, 2016 - link
WTF are you talking about?
Broadcom, Applied Micro, Cavium, Phytium have all announced ARMv8 "server" class cores. Most of these have been what I call bringup cores, announced with no expectation of sales, they're just to allow the ecosystem to start.
The 2017 offerings from these companies should be learning cores --- good enough to be competitive in some markets, but still primarily targeted at helping the big companies, the ecosystem, and the designers to understand what they are doing well and what need to be done better.
The 2019 crop of cores are the ones expected to start making serious inroads. (ARM PLC expects ~20% of server sales in 2020, but only very limited takeup before then).

This is announced parts. ARM PLC may be designing its own part that's not yet public. Apple is IMHO already designing desktop and server parts for their own use. There's probably at least one other Chinese company, in addition to Phytium, designing a server or HPC part. And we know Fujitsu has HPC ARM plans, though who knows if they will sell those chips to third parties.
deltaFx2 - Tuesday, December 20, 2016 - link
Broadcomm and I believe AMCC have dropped out of the ARM server business. Cavium doesn't have the high performance part nailed yet, but rumors on the internets suggest that Cavium acquired Broadcom's server effort. So then there were two. Qualcomm, and Cavium in the US, and Phytium and others in China.

"The 2019 crop of cores are the ones expected to start making serious inroads. (ARM PLC expects ~20% of server sales in 2020, but only very limited takeup before then)." In 2012, 2016 was supposed to be the year of ARM servers with 10% sales (can't remember the source). THe unanswered question is, why? Why cut your nose to spite your face? Not everyone loves Intel, but what's the reason to switch unless ARM surpasses intel by a significant enough margin for it to matter? And all this assumes Intel is sitting on its backside doing nothing.

It's a serious question... what's behind all the exuberance? One could license MIPS, or SPARC, in the past. Why is this time different?
FriendlyUser - Saturday, December 17, 2016 - link
How would this match against AMD Naples with 32 cores and 64 threads? It should be available soon.
daniel1926 - Sunday, December 18, 2016 - link
This chip will be a nightmare for software that is licensed on a per core basis. Strong ST performance will remain the king for this type of SW
Meteor2 - Tuesday, December 20, 2016 - link
This thing looks to be more in the vein of/a competitor to Xeon Phi than anything. It's beyond Xeon-D.

Lots of parallel wimpy cores, but not as hard to utilise as a GPGPU.
twotwotwo - Saturday, December 24, 2016 - link
So, know no one will read comments now, but some speculation:

- This has to win on something other than just price. Just the fab costs keep it from being too cheap, plus Qualcomm wants its R&D money back, and Intel can afford to cut prices to compete, and the costs of other components of a server/datacenter mean a cheaper CPU isn't revolutionary if it has nothing else going for it.

- I think the real competitor might be the Xeon D--vs. other Intel server chips it's slower (but still Intel big cores, so not *that* slow) and cheaper, and it has a NIC onboard. Density is a big deal, and you can see how small Facebook got its Xeon D nodes here: https://code.facebook.com/posts/1711485769063510/f...

- I think a large cloud corp could find a niche where an ARM chip would be at least *acceptable*. Some loads want cores as fast as you can practically get them; some don't. Right now, e.g. Google deals with this by loading up a bunch of "balanced" boxes with a variety of tasks that each need different things (e.g. RAM, CPU, disk (space and IOPS), network), but I bet they can find a place for specialized boxes with ARM cores if they want. The speed has to be "good enough," but something slower than Intel big cores might meet that bar.

- I imagine that Intel will often be ready to make a counter-offer, and, hey, their stuff is good and they're the incumbent. I suspect to go for Qualcomm's chips, a cloud company has to be playing a long game, trying to build an alternative to Intel at least in some niche. (I'd also expect them to brag about the power savings and environmental friendliness of low-power cores, but somewhat cynically I'm not sure that will really be much of a factor for them while they shop.)

- As post says, I wonder what components like the NIC that Qualcomm brought onto the SoC. I also see everyone bragging about how great their future process nodes will be; it's hard to guess much, but wonder whether there might be a closer competition in the future on the process side.

- Anyway, it's a moderately big deal if we see a big US cloud company talk about their use of something like this in production, very big deal if, you know, AWS/GCP/Azure come out with an ARM server instance type and it's legitimately interesting to use.

ARM servers have been *just about* to become a thing a long time, so I don't even pretend to have a prediction anymore, but if something does get going in a chip generation or two it'll be neat to watch.

Qualcomm Demos 48-Core Centriq 2400 Server SoC in Action, Begins Sampling

Post Your Comment

88 Comments

Back to Article

witeken - Friday, December 16, 2016 - link

prisonerX - Friday, December 16, 2016 - link

witeken - Friday, December 16, 2016 - link

Krysto - Friday, December 16, 2016 - link

witeken - Friday, December 16, 2016 - link

witeken - Friday, December 16, 2016 - link

name99 - Saturday, December 17, 2016 - link

Gondalf - Sunday, December 18, 2016 - link

Wilco1 - Sunday, December 18, 2016 - link

Gondalf - Sunday, December 18, 2016 - link

Wilco1 - Sunday, December 18, 2016 - link

deltaFx2 - Sunday, December 18, 2016 - link

name99 - Sunday, December 18, 2016 - link

name99 - Sunday, December 18, 2016 - link

Kevin G - Monday, December 19, 2016 - link

name99 - Monday, December 19, 2016 - link

Kevin G - Monday, December 19, 2016 - link

name99 - Monday, December 19, 2016 - link

chlamchowder - Sunday, December 18, 2016 - link

Michael Bay - Monday, December 19, 2016 - link

Kevin G - Monday, December 19, 2016 - link

chlamchowder - Friday, December 16, 2016 - link

witeken - Friday, December 16, 2016 - link

beginner99 - Saturday, December 17, 2016 - link

ddriver - Saturday, December 17, 2016 - link

Wilco1 - Saturday, December 17, 2016 - link

Gondalf - Sunday, December 18, 2016 - link

Wilco1 - Sunday, December 18, 2016 - link

Gondalf - Sunday, December 18, 2016 - link

Wilco1 - Sunday, December 18, 2016 - link

name99 - Monday, December 19, 2016 - link

beginner99 - Saturday, December 17, 2016 - link

serendip - Saturday, December 17, 2016 - link

deltaFx2 - Saturday, December 17, 2016 - link

Kevin G - Monday, December 19, 2016 - link

FunBunny2 - Saturday, December 17, 2016 - link

patrickjp93 - Saturday, December 17, 2016 - link

Wilco1 - Sunday, December 18, 2016 - link

serendip - Sunday, December 18, 2016 - link

extide - Saturday, December 17, 2016 - link

MrSpadge - Monday, December 19, 2016 - link

deltaFx2 - Saturday, December 17, 2016 - link

pkese - Sunday, December 18, 2016 - link

Wilco1 - Sunday, December 18, 2016 - link

deltaFx2 - Sunday, December 18, 2016 - link

name99 - Monday, December 19, 2016 - link

Kevin G - Monday, December 19, 2016 - link

name99 - Monday, December 19, 2016 - link

deltaFx2 - Tuesday, December 20, 2016 - link

Wilco1 - Wednesday, December 21, 2016 - link

deltaFx2 - Thursday, December 22, 2016 - link

deltaFx2 - Thursday, December 22, 2016 - link

azazel1024 - Tuesday, December 20, 2016 - link

deltaFx2 - Saturday, December 17, 2016 - link

Antony Newman - Saturday, December 17, 2016 - link

MrSpadge - Monday, December 19, 2016 - link

iwod - Friday, December 16, 2016 - link

boeush - Saturday, December 17, 2016 - link

patrickjp93 - Saturday, December 17, 2016 - link

Ariknowsbest - Saturday, December 17, 2016 - link

webdoctors - Sunday, December 18, 2016 - link

FunBunny2 - Sunday, December 18, 2016 - link

Wilco1 - Sunday, December 18, 2016 - link

FunBunny2 - Sunday, December 18, 2016 - link

Wilco1 - Sunday, December 18, 2016 - link

FunBunny2 - Sunday, December 18, 2016 - link

deltaFx2 - Sunday, December 18, 2016 - link

Wilco1 - Sunday, December 18, 2016 - link

deltaFx2 - Sunday, December 18, 2016 - link

name99 - Monday, December 19, 2016 - link

deltaFx2 - Tuesday, December 20, 2016 - link

Meteor2 - Tuesday, December 20, 2016 - link

deltaFx2 - Tuesday, December 20, 2016 - link

deltaFx2 - Sunday, December 18, 2016 - link

Threska - Tuesday, January 3, 2017 - link

evilspoons - Saturday, December 17, 2016 - link