Name: Analyzing Falkor’s Microarchitecture: A Deep Dive into Qualcomm’s Centriq 2400 for Windows Server and Linux
Item: Analyzing Falkor’s Microarchitecture: A Deep Dive into Qualcomm’s Centriq 2400 for Windows Server and Linux
Author: Dr. Ian Cutress

Analyzing Falkor’s Microarchitecture: A Deep Dive into Qualcomm’s Centriq 2400 for Windows Server and Linux

by Ian Cutress on 8/20/2017 11:00 AM EST

Post Your Comment
Please log in or sign up to comment.

Comments Locked

41 Comments

Back to Article

tipoo - Sunday, August 20, 2017 - link
Big ARM server CPUs will be interesting. The ISA is very sane and scalable, if the investment and demand was there it would have no issue getting to where large x86 cores are, the ISA was never the limit.

Then we can see if they can actually exceed them.
Kevin G - Sunday, August 20, 2017 - link
This makes me wish that Apple would license their cores to 3rd parties. Recent Apple cores are getting very close to where x86 lies per clock and they've certainly exceeded x86 in performance/watt in the ultra mobile space (granted Intel's last round of ultra mobile chips was flat out cancelled, skewing such a comparison).

Considering Apple's work in ultra mobile, I find it believable that a higher performance per clock design in the server space is feasible for an ARM design. A company with enough resources just needs to do it.
iwod - Sunday, August 20, 2017 - link
If the leaked numbers for A11 were true then Apple may have exceeded the performance / clock against Intel x86 as well.

While Apple are highly unlikely to ever license their Cores out, I wish they could use those Cores and make an Xserve Server Come back.
peevee - Monday, August 21, 2017 - link
XServe died because of their own OS. Nobody is interested in anything but Linux (and sometimes a little Windows).
But they could have sold it with Linux though.
Dr. Swag - Sunday, August 20, 2017 - link
Apple never will though, since it's Apple we're talking about. They keep their tech to themselves to give themselves the advantage.
name99 - Sunday, August 20, 2017 - link
The only benchmarks that exist are geekbench4 and the browser benchmarks against Apple laptop hardware. By THOSE benchmarks A9X matched Intel in IPC and A10X exceeds by around 15%.

This is clearly an area that draws out the crazies in full screaming mode because a lot of assumptions have to be made (for example the most realistic assumption is that the high-end Intel scores occur at the maximum turbo frequency, but the crazies will insist that, no, you have to normalize to the baseline intel frequency for that particular CPU). Or you get the insistence that the ONLY measurement that matters is against SPEC2006 compiled with icc, which runs into the issues that icc has MASSIVE effects on SPEC; and that no SPEC numbers in any form exist for the A10/A10X.

At the end of the day, it boils down to "what is your goal?" If your goal is an honest comparison of the two processor families, the best data available suggests the summary I gave. If your goal is "my CPU can beat up your CPU" then all the data in the world presumably won't change your mind, and the best data of all is non-existent data (like the certain claims as to how the A10X would or would not behave on SPEC2006).

Final point. It is not at all implausible, IMHO, that Apple have a plan, and have already started proceeding down it, for ARM in their data centers. After all, why not? It saves them money, it allows them to run at their pace not Intel's (eg install AI or compression or encryption accelerators as they need them) and provides better security (both security through obscurity and not having as large an attack surface as Intel).
But why would they talk about it? Apple says nothing ever, unless they have to. No way they would advertise to their competitors the extent to which they have comparative advantage through use of their own data warehouse chips (for at least some purposes).
zodiacfml - Monday, August 21, 2017 - link
Not sa fast. Apple's SoC's are huge in die size which is the reason for their performance. They are as big or bigger than Intel Core. The best part for comparison are the Core M parts. There is little or no business for Apple to do this. There are rumors using Apple SoC on a Macbook Air but that will make little sense as they will to need port OSX to ARM. Again, that is not a good idea as Macbook Pro nor the Mac Pros will continue with OSX .
cdillon - Monday, August 21, 2017 - link
Apple has already ported OSX to ARM, and they call it "iOS". It's not going to be as big a deal as you think to get OSX as we know it to run in ARM. Not only that, but they already have experience with juggling two processor architectures (PPC and x86) at the same time in one OS.
extide - Monday, August 21, 2017 - link
And 68k to PPC, back in the day
name99 - Monday, August 21, 2017 - link
Apple's SoCs are not huge, neither are their cores.
The iPhone SoC's tend to hover around 100 to 120mm^2, the iPad SoCs sometime reach 150, though the A10X is below 100.
The cores are a few mm^2. Eyeballing it, I'd say the entire CPU complex (2 large cores, two small cores, and L2) is about 12mm^2. This is substantially larger than ARM cores (four A73s+their L2 in the same process technology would fit in 8mm^2) but substantially smaller than Intel (an Intel core these days runs at around 8mm^2 in Intels 14nm).
SarahKerrigan - Sunday, August 20, 2017 - link
"Cavium is the most notable public player using ARM designs in commercial systems so far (there are a number of non-public players focusing on niche scenarios, or whom have little exposure outside of China). The latest design, the Cavium ThunderX2, uses the main A-series core licenses and interconnect license from ARM to provide large numbers of mobile-class CPU cores with as much memory bandwidth and IO as possible."

This is not even remotely true. Neither Cavium's cores nor Cavium's interconnect (CCPI predates Cavium's jump to ARM) are ARM IP - they're using an architectural license, *not* IP blocks (or at least, not those ones.) ThunderX uses custom Cavium cores that are between A53 and A57 in performance, while ThunderX2 uses a small number of cores (32) based on the XLP/Vulcan design they bought from Broadcom.

To make that last part more confusing, Cavium initially announced a *different* ThunderX2, which was an enhanced (54-core) derivative of the original ThunderX design. This seems to have been killed when the Vulcan uarch was licensed, or at least has not been heard from since.
Ian Cutress - Sunday, August 20, 2017 - link
That's my fault, I wrote this while flying and thought I had known what is under the hood on ThunderX. Johan actually did a good write up on this, and I'll edit the piece here appropriately.

http://www.anandtech.com/show/10353/investigating-...
SarahKerrigan - Sunday, August 20, 2017 - link
"uses the architecture licence for the main A-series core from ARM"

That makes even less sense. A-series cores don't factor into it. ThunderX is custom.
name99 - Sunday, August 20, 2017 - link
Is this public knowledge (original ThunderX2 killed, new ThunderX2 based on Vulcan)?
I know it's public that (beginning of this year) Cavium acquired Vulcan IP, but I'd not heard anything beyond that. ThunderX2 is supposed to ship Q3 this year (ie RSN...) which to me suggests they're too far along to drop it, and Vulcan will be the basis of ThunderX3.
SarahKerrigan - Sunday, August 20, 2017 - link
Yes. There have been a number of commits to LLVM, etc, indicating that ThunderX2 is now Vulcan. Cf the ThunderX2 LLVM model, which straight-up says "Based on Broadcom Vulcan."

I don't know whether the original TX2 design is fully dead or merely mostly dead, but it's pretty obvious at this point that a Vulcan-based TX2 is coming.
SigismundBlack - Sunday, August 20, 2017 - link
Thanks for the info.

Denverton rather than 'Denveron'.

Since the C3000 Atom series is cited here re it's also seems worth mentioning AMDs low power server SOCs (e.g. X3421) which likewise feature in recent Moonshot systems and home/SOHO servers.
jameskatt - Sunday, August 20, 2017 - link
The biggest problem I see is if Qualcomm is going to be devoting resources for this project for the long-term. Businesses require stability, predictability, and long-term support. Qualcomm's competitors have been in the business for decades and will be in the business for decades. Qualcomm can't prove they will be in the business for decades to come particularly if they make no money on it.
Kevin G - Sunday, August 20, 2017 - link
Qualcomm has been around for awhile so there is stability there. They are new to the ARM server market though because, well after many false starts this market appears to finally be emerging. Even though Qualcomm is just launching this chip, it would be beneficial to them to discuss a roadmap to bring some long term stability to the scene.
Wardrive86 - Sunday, August 20, 2017 - link
Surely Qualcomm is using SVE and not regular NEON units. I wish they would expose how wide the units are. I'm very excited they were so open about their architecture. Great write up Ian as well!
Dmcq - Sunday, August 20, 2017 - link
I doubt it. SVE is a biggie and was only announced recently, I can't see that Qualcomm would bother risking trying to put it in their first server chip.
SarahKerrigan - Sunday, August 20, 2017 - link
I seriously doubt SVE is present. As far as I know, Fujitsu is still lined up to be the first SVE user, and it's not like ultra-wide vectors are a massive boost to conventional enterprise servers.
Kevin G - Sunday, August 20, 2017 - link
Using SVE requires ARM v8.2A support which this does not appear to have. The ARM v8.2A spec only was announced in January 2016, which isn't enough time to get it implemented into anything that'd be shipping now. Qualcomm could have been working behind the scenes but that would have given them perhaps another year with a spec that could change before formal publishing (i.e. may require some last minute changes right past the design would be tapping out). For a server part, that path would be unwise.

SVE was announced a year ago and is far more complex than the v8.2A released due to how it handles execution width. A SVE design right now is a virtual impossibility.
Hurr Durr - Sunday, August 20, 2017 - link
I`d rather read something on x86 thing on ARM that MS and Qualcomm have than this. Much more potential for the real world.
Kevin G - Sunday, August 20, 2017 - link
Can we get an editor in here?

"For SoC design followers, one might look at this design and think they see similarities with designs such as AMD’s original Bulldozer design from 2011. ... Actually, after writing that last sentence, it is basically a Xeon Phi dual core module."

While perfectly readable, that last paragraph could use a bit of a rewrite due to the last sentence nullifying it. My quick stab at a rewrite:

For SoC followers, the Qualcomm pairs two modules per fabric stop similar to what Intel has implemented in their most recent Xeon Phi chips. Unlike the new grid topology in the Xeon Phi, Qualcomm is using a ring bus akin to what Intel uses on its Xeon E5 and E7 chips. Those thinking that a dual core module would follow AMD's Bulldozer philosophy will be disappointed to learn that no execution resources are shared between the cores, just the L2 cache, power management and bus interface.

The same comparisons and ideas are made but they flow to the reader a bit more logically to me.
FunBunny2 - Sunday, August 20, 2017 - link
-- Unlike the new grid topology in the Xeon Phi, Qualcomm is using a ring bus akin to what Intel uses on its Xeon E5 and E7 chips.

I've long wondered how hardware engineers:
1) discover such alternatives
2) decide which one to choose

Is this fundamental math and physics laws, or trial and error? anyone know a readable (for the non-physics major, that is) source?
Kevin G - Monday, August 21, 2017 - link
Topology has been a well studied concept. At a high level, this mimics general networking design closely. The choice of on-die topology is generally at the mercy of engineering trade offs that are unique in this context.

The ring bus you get an easy means of scaling the number of units but the trade of is an increase in latency around the ring bus as the numbers go up. Diminishing returns are hit as the numbers increases. With a ring though, individual units on the ring can be radically different sizes on a die as long the links between stops can be roughly the same for timing purposes. A ring bus also permits a relatively predictable latency to reach stops further away, something noteworthy for implementing coherency protocols. Another trade off with the ring design is that it'll always consume power. Nodes that are not in use still need to have the ring stop going to permit data passage through it.

Qualcomm side steps the ring issue a little bit by including two cores per ring stop, thus putting the minimum number of stops at 24. Just like Intel, I suspect on-die IO like PCIe, memory controllers etc. will have their own ring stops. It is not clear if this all on one massive ring bus or like the last generation of E5/E7 high core count chips, several rings are used with discrete bridges between them. Intel never went beyond 16 ring stops in a design.

A grid topology requires far greater engineering resources to implement correctly. Physical size has to be the same for those nodes in the middle of the grid but there is a bit of wiggle room along the perimeter to expand in one of the two dimensions (handy for things like PCIe, memory controllers that have a fixed need per socket). Cache coherency has to account for variable latency between nodes on the grid: there are several paths to between source and destination. The main benefit of a grid though is that scaling is vastly improved as core count increases. Another benefit is that not all the links in between cores needs to be active to move data. This saves power. Due to the ability to route around congested links, the individual links between grid nodes do not necessarily have to be as wide as those on a ring, saving a bit of energy there while maintaining similar aggregate bandwidth. For servers, multi-pathing of data (ie sending it twice) is also possible for increased RAS if an error in transmit is encountered along a particular path. Intel hasn't indicated that they're doing multipathing but could be a feature they add down the road. In the future if chips stacking emerges outside of research labs as feasible, the grid topology can also expand into the 3rd dimension.

The recent Xeon Phi isn't Intel's first attempt at a grid topology. The first publiclly shown off design was there Terascale research chip a decade ago. ( http://www.eetimes.com/document.asp?doc_id=1303295 ) Inter core topology was a major driver of that research effort and the recent Xeon Phi and Xeon series are the result of those efforts. Intel isn't event the first to implement a grid topology. The Compaq/DEC Alpha EV7 did so between sockets and permitted up to 64 sockets in an 8 x 8 grid from the early 2000's. IBM used a grid-like design for their BlueGene super computer designs to move data, though that wasn't cache coherent. There could be earlier instances as those are a few that I know off hand.
FunBunny2 - Monday, August 21, 2017 - link
thanks. much clearer.
Ryan Smith - Sunday, August 20, 2017 - link
Thanks!
Lord-Bryan - Sunday, August 20, 2017 - link
"So we have to admit that we were surprised by Qualcomm releasing so much information about the pipeline. When we’ve ever asked the mobile CPU team about Krait and Kryo, we usually hit a brick wall, left with a PR answer of a ‘custom core design’ or the guide of ‘protecting our design"
Well am not surprised, releasing architectural details of server cpus, has always been an industry norm. It is something they just have to do if they want to be relevant, you can't just sell black boxes worth thousands of dollars to just anyone.
Lord-Bryan - Sunday, August 20, 2017 - link
Plus developers will have to know how the processor works in other to optimize applications for it, Qualcomm is playing with the big boys now, no room for unnecessary pride.
DanNeely - Sunday, August 20, 2017 - link
Mobile CSS needs fixed. Bulleted lists need to wrap, instead of overflowing into horizontal scroll as they currently do on Android.
twotwotwo - Sunday, August 20, 2017 - link
The things I most wonder about the chip itself are, how much worse will single-thread speed be, and how much better will throughput/$ be?

For serial speed, there's sort of a sliding scale of "good enough"; you can certainly find uses for chips that are slower than Intel's large cores, but as single-thread perf gets worse and worse, more types of app become tricky to run on it because of latency. So you want latency stats for typical enterprisey Java or C# (or, heck, Go) Web app or widely used databases or infrastructure tools. That's also a test of how well compilers and runtimes are tuned for ARMV8, the chip, and live with may cores, but since early customers will have to deal with the ecosystem that exists today, that's reasonable.

For cost-effective throughput I guess we need to have an idea of at least price and power consumption, and parallel benchmarks that will hit bottlenecks a single-threaded one might not, like memory bandwidth. And the toughest comparison is probably against Intel's parts in the same segment, Xeon D and their server Atom chips. Something that makes it harder to win big on throughput/$ is that the CPU's cost and power consumption are only a piece of the total: DRAM, storage, network, and so on account for a lot of it. Also, the big cloud customers Qualcomm wants to win probably aren't paying the same premiums as you and I are to Intel.

Then, aside from questions about the chip itself, there are questions about the ecosystem and customers. There are the questions above of how well toolchains and software are tuned. Maybe the biggest question is whether some big customer will make the leap and do some deployments on lots of slower cores. It might be a strategic long-term bet for some big cloud company that wants more competition in the server chip space, but I bet they have to be willing to lose real money on the effort for a generation or two first.
name99 - Monday, August 21, 2017 - link
Intel sells 28 core CPUs that run at 2.1 GHz (and turbo up to 3.8GHz but see below).
Hell, they sell 16 core systems that run at 2.0 GHz and only turbo up to 2.8 GHz.

Remember QC is not TRYING to sell these to amateurs, or even as office servers. They are targeted at data warehouse tasks where the job they're doing will be pretty well defined, and it's expected for the most part that ALL the cores will be running (ie when the work load lightens, you shut down entire dies and then racks, you don't futz around with just shutting down single cores).
For environments like that, turbo'ing is of much less value. QC doesn't have to (and isn't) targeting the entire space of HPC+server+data warehouse, just the part that's a good match to what they're offering.
Threska - Tuesday, August 22, 2017 - link
Well there is the small developer virtualization market like Ansible.
prisonerX - Monday, August 21, 2017 - link
A preoccupation with single thread performance is the domain of video game playing teenagers and not terribly important, neither is the "latency" you refer to. This sort of high-core, efficient processor is going to be used where throughput and price/power/performance ratio (ie, all three, not just any one of those), are the key metric.

Latency is mostly irrelevant since processing will be stream oriented and bandwidth limited rather than hamstrung by latency (thus features such as memory bandwidth compression). Gimmicks like "turbo" (which should be called by its proper name: "throttling") and favoring single thread performance are counterproductive in this mode. Being able to deploy many CPUs in dense compute nodes is what is required and memory, storage and networking are minor parts here.

I don't know why you think compilers or runtimes is a concern, 99.9% of code is common across archs, so if you've supported a lot of x86 cores your code is going to function well for a lot of ARM cores with a small amount of arch specific configuration. The compilers themselves, namely GCC and LLVM are mature as is their support for ARM.

Finally the new ARM CPUs don't have to beat Intel, just stay roughly competitive, because the one thing the tech industry hates more than a monopoly is a monopoly that has abused its monopoly powers, and Intel is it. Industry is itching for an alternative, and near enough is good enough.
deltaFx2 - Thursday, August 24, 2017 - link
"Industry is itching for an alternative": While this is true, is the industry truly interested in an alternative ISA, or alternative supplier? Because there is one now in the x86 space, and is very competitive, and in some metrics better than Intel. Also, your argument about single threaded performance being irrelevant in servers is false. A famous example of this is a paper in ISCA by google folks arguing in favor of high IPC machines (among other things). They also note that memory bandwidth is not as critical as latency. Now this is specific to google, but in plenty of other cases too, unless you have a very lopsided configuration, bandwidth doesn't get anywhere near saturation. There are also plenty of server users who provide extra cooling capacity to run at higher than base frequencies because it's cheaper than scaling out to more nodes. Obviously, your workloads should scale with freq.

"Finally the new ARM CPUs don't have to beat Intel," -> change intel to AMD. AMD is hugely motivated to compete on price and has the performance to match intel in many workloads. And AMD's killer app is the 1P system, exactly where Qualcomm intends to go. You also have to add the cost of porting from x86->ARM (recompile, validation, etc). Time is money and employees need to be paid. So the question is, why ARM? More threads/socket? Nope. More memory/socket? Nope. More perf/thread? Probably not based on the architecture described but we'll see. More connectivity then? Nope. Lower absolute power? Maybe. Lower cost? I suspect AMD's MCM design is great for yields. And there's the porting cost if you're not already on ARM.

There's a lot more work to be done and money to be spent before ARM becomes competitive in the mainstream server space. QC has the deep pockets to stick it out, but I am not sure about cavium.
Gc - Sunday, August 20, 2017 - link
Confusing terminology: prefetch vs. fetch

Prefetch heuristics predict *future* memory addresses based on past memory access patterns, such as sequential or striding patterns, and try to prefetch the relevant cache lines *before* a miss occurs, attempting to avoid the cache miss or at least reduce the delay. A memory fetch to satisfy a cache miss is not a prefetch.

Slide: "Hardware Prefetch on L1 miss."
Text: "An L1 miss will initiate a hardware prefetch."
The initial fetch is after the miss, so the initial fetch is not a 'prefetch'. I assume this means that it is not only fetching the missed cache line but also triggering the prefetchers to fetch additional cache lines.

Slide: "Hardware data prefetch engine Prefetches for L1, L2, and L3 caches"
Text: "If a miss occurs on the L1-data cache, hardware data prefetchers are used to probe the L2 and L3 caches."
The slide is saying that data is prefetched at all three levels of cache. I'm not sure what the text is saying. Probing refers to querying the caches around the fabric to see which if any holds the requested cache line. This is part of any fetch, not specific to prefetching. Maybe the text is trying to say that the prefetchers not only remember past addresses and predict future addresses, but also remember which cache held past addresses and predicts which cache holds fetched and prefetched addresses?
YoloPascual - Monday, August 21, 2017 - link
Inb4 fanless data centers near the equator.
KongClaude - Tuesday, August 22, 2017 - link
'however Samsung does not have much experience with large silicon dies'

I don't remember the actual die size for the DEC Alpha's that Samsung fabbed back in the day, the Alpha was a fairly large CPU even by todays standard. Would they have let go of that knowledge or is Alpha being relegated to low-volume/not much experience?
psychobriggsy - Tuesday, August 22, 2017 - link
We should also consider that GlobalFoundries licensed Samsung's 14nm after digging their own 14nm hole and failing to get out of it, and right now AMD are making 486mm^2 Vega dies on that process. The process doesn't have a massive maximum reticle size however, IIRC it's around 700mm^2, whereas TSMC can do just over 800mm^2 on their 16nm.
dennisAustin - Thursday, August 31, 2017 - link
Interesting...when I first read about it nearly a half-decade ago - even Intel and AMD have shorter design cycles _and_ x86 architecture is orders of magnitude more complex than ARM's. The target market for "Lunesta" or "Centriq?" is the data center, not angry birds, Spec, or HPC. ARMs RAS specification is in its infancy at best and likely scantly implemented in Lunesta.

It was a great concept - but the window of opportunity has long sense passed - 5ish years in and likely $1B+ invested, it's a non starter. My advice - drop the IBM managers, re-hire managers with technical backgrounds -- remember this product is being driven by the same guy who said "who needs 64 bits?" -- unfortunate

Analyzing Falkor’s Microarchitecture: A Deep Dive into Qualcomm’s Centriq 2400 for Windows Server and Linux

Post Your Comment

41 Comments

Back to Article

tipoo - Sunday, August 20, 2017 - link

Kevin G - Sunday, August 20, 2017 - link

iwod - Sunday, August 20, 2017 - link

peevee - Monday, August 21, 2017 - link

Dr. Swag - Sunday, August 20, 2017 - link

name99 - Sunday, August 20, 2017 - link

zodiacfml - Monday, August 21, 2017 - link

cdillon - Monday, August 21, 2017 - link

extide - Monday, August 21, 2017 - link

name99 - Monday, August 21, 2017 - link

SarahKerrigan - Sunday, August 20, 2017 - link

Ian Cutress - Sunday, August 20, 2017 - link

SarahKerrigan - Sunday, August 20, 2017 - link

name99 - Sunday, August 20, 2017 - link

SarahKerrigan - Sunday, August 20, 2017 - link

SigismundBlack - Sunday, August 20, 2017 - link

jameskatt - Sunday, August 20, 2017 - link

Kevin G - Sunday, August 20, 2017 - link

Wardrive86 - Sunday, August 20, 2017 - link

Dmcq - Sunday, August 20, 2017 - link

SarahKerrigan - Sunday, August 20, 2017 - link

Kevin G - Sunday, August 20, 2017 - link

Hurr Durr - Sunday, August 20, 2017 - link

Kevin G - Sunday, August 20, 2017 - link

FunBunny2 - Sunday, August 20, 2017 - link

Kevin G - Monday, August 21, 2017 - link

FunBunny2 - Monday, August 21, 2017 - link

Ryan Smith - Sunday, August 20, 2017 - link

Lord-Bryan - Sunday, August 20, 2017 - link

Lord-Bryan - Sunday, August 20, 2017 - link

DanNeely - Sunday, August 20, 2017 - link

twotwotwo - Sunday, August 20, 2017 - link

name99 - Monday, August 21, 2017 - link

Threska - Tuesday, August 22, 2017 - link

prisonerX - Monday, August 21, 2017 - link

deltaFx2 - Thursday, August 24, 2017 - link

Gc - Sunday, August 20, 2017 - link

YoloPascual - Monday, August 21, 2017 - link

KongClaude - Tuesday, August 22, 2017 - link

psychobriggsy - Tuesday, August 22, 2017 - link

dennisAustin - Thursday, August 31, 2017 - link

Log in

Don't have an account? Sign up now