Comments for 896 Xeon Cores in One PC: Microsoft’s New x86 DataCenter Class Machines Running Windows

896 Xeon Cores in One PC: Microsoft’s New x86 DataCenter Class Machines Running Windows

by Ian Cutress on 10/26/2018 11:00 AM EST

Post Your Comment
Please log in or sign up to comment.

Comments Locked

56 Comments

Back to Article

HStewart - Friday, October 26, 2018 - link
My guess is this is joint venture between Microsoft and Intel. Intel called the new Xeon's scalable for a reason and I believe this system has new chipset that allows it to scalable to from 8 to 16 to 32 cpus's and maybe any more.

Intel has 8 cpu systems for over a decade - my biggest question since the occurance of multi-core cpu's is what is difference between 8 single core box and 8-core cpu in performance.

Of course there is some physical limitations of budding say using Zen architecture system with 896 process system with 112 8-core zen's

I think Intel / Microsoft have found a way to interconnects scalable and this comes interesting thought on value of have more cores on cpu. Especially if system was designed to be pluggable and if one of cpu's failed than it does not bring entire system down.
name99 - Friday, October 26, 2018 - link
What EXACTLY is promised here? A 32-element cluster of 28 core elements, or even a 4-element cluster of 8*28 core elements is no big deal; a cluster OS is doubtless useful for MS, but no great breakthrough. So is the claim that this is a SINGLE COHERENT address space?

As cekim says below, why? What does coherency buy you here? And if it's not coherent, then what's new? Is the point that they managed to get massive-scale coherency without a dramatic cost in latency and extra hardware (directories and suchlike)?
HStewart - Friday, October 26, 2018 - link
what is then the difference of 32 core single cpu and dual 16 core system? In single OS we usued have to run multiple CPU but now we have multiple core system which is great for laptop but when we talking about servers - why not have 32 cpus in the boxed also. In that world core could is basically cores per cpu multiple sockets
mode_13h - Friday, October 26, 2018 - link
You're conveniently abstracting away all of the practical details that would answer your question.
peevee - Monday, October 29, 2018 - link
"So is the claim that this is a SINGLE COHERENT address space?"

Almost certainly.

"As cekim says below, why? What does coherency buy you here?"

Nothing particularly good. A temptation to use SMP or even 2-node-optimized NUMA software, only to discover that it does not scale beyond 16-32 threads.
cekim - Friday, October 26, 2018 - link
Welcome to 20 years ago with SSI, SKMD, MOSIX, etc... This avenue of architecture has a trail of bodies on it so far. Just no compelling improvement over other variations of task migration, load sharing, MPI, etc... thus far demonstrated to justify the complexity in 99.99999999999% of use cases. Windows has such a solid track record on scheduling and stability so far, I’d be sure to sign up for more pain the first chance I got... /sarcasm
HStewart - Friday, October 26, 2018 - link
One thing different, is that a single one of these 896 cores in this beast is more power than main frame computers from 20 years ago and now you have 896 of these in at least 1/10 the size.
cekim - Friday, October 26, 2018 - link
That’s not different than single system image... just higher density. Companies like cray, SGI, Fujitsu etc... have been taking any processor they can get their hands on and connecting them up to high speed low latency fabrics and providing a single system image view of such machines from the operator level. When cray/sgi used alphas, mips opterons and xeons with 1-N cores for this. Once Beowulf showed up things migrated that way to using commodity or leading edge but still Off-the-shelf networks of independent nodes with a central director node....

This is a pretty well worn path at this point. The question is whether MSFT can provide a compelling licensing option and make it actually perform?
HStewart - Friday, October 26, 2018 - link
One difference possibly could be memory usage depending on how the bus is made
mode_13h - Friday, October 26, 2018 - link
Um, no. Memory usage must be local to each blade, or else it will perform like garbage. Power efficiency would be much worse, as well.
alpha754293 - Saturday, October 27, 2018 - link
I think that you've accidentally hit the nail on the head though:

All of those systems that you've mentioned above (Cray, SGI, Fujitsu, etc.) with Alphas, MIPS, SPARC, etc. - note that none of those are x86 or x86_64 ISA class systems.

a) Those were all proprietary systems with proprietary splinters of the original BSD SVR4 UNIX and they would "appear" to be single image systems to the user, but in reality, you KNOW that they weren't because the technology wasn't actually truly there. So example, the head node will deploy the slave nodes and the slave processes of said slave nodes, and transferring said slave processes acrossed or between slave nodes was generally NOT a trivial task. Once the slave process has been assigned to a slave node, migration between was generally, at least not trivial.

Depending on what you're running, that can be both a good thing and a bad thing. If you have a HPC run and one of the slave nodes is failing and/or where the hardware has failed; the death of the slave process will report back to the headnode as the entire run having failed. The lack of portability in the slave processes means that you don't, inherently, have some level of HA available to you within the single image.

This is what, I presume, would be one of the key differences with the Windows-based single image.

If you want to lock out a slave process by way of processor affinity mask, you can do so across the sockets/nodes. But if you want to have HA, say, "within a box", you can also do that as well.

And again, I think that you accidentally hit the nail on the head because unlike the Cray/SGI/Fujitsu/POWER systems, the single image isn't initialized via the headnode. In this case, I don't think that there IS a headnode in the classical sense per se anymore, and THAT is what makes this vastly more interesting (perhaps) than the stuff that you mentioned.

(Also, oh BTW, the systems that you've mentioned, the head node does NOT typically, perform any of the actual computationally intensive work. It's not designed to, and it's only design to keep everything else coordinated. Thus, if you only have two nodes (for example) within that example, trying to get the head node to also perform the computations is NOT a common nor recommended practice. You'll see that from all of the various sysadmin guides from the respective vendors.)
theeldest - Friday, October 26, 2018 - link
HPE superdome flex supports 32 Intel Xeon Platinum sockets: https://psnow.ext.hpe.com/doc/PSN1010323140USEN.pd...
HStewart - Friday, October 26, 2018 - link
Interesting,, check out the link with in -- it states it support Windows DataCenter Server 2016

https://h20195.www2.hpe.com/v2/GetDocument.aspx?do...

My question is this same unit - as Microsoft using - it is modules and support 4 to 32 cpus in 4 cpu increments. Each 4 cpu is pluggable for 99% up time.
Alex_Haddock - Friday, November 2, 2018 - link
Azure is typically reticent to name vendor kit and whilst I don't work in the Superdome Flex group directly I'd say there is a high chance...The system is a very nice amalgamation of the SGI UV acquisition and the mission critical aspects of the Superdome range. Roadmap is very cool (any channel partners reading this attending TSS Europe in March 19 or the US equivalent I'd highly recommend attending some of the Superdome sessions). Will see if any chance of getting Ian remote access to one for an AnandTech test as well...I've enjoyed CPU articles on this site myself since its inception :-).
arnd - Friday, October 26, 2018 - link
Huawei have another one with similar specifications:
https://e.huawei.com/kr/products/cloud-computing-d...
HStewart - Friday, October 26, 2018 - link
This one is slightly different - it only supports 1-96 cores per partition for total of 80 partitions

So basically like 80 connected machines and not treated as single machine.
loony - Friday, October 26, 2018 - link
Microsoft already uses the SD Flex for their Linux based Azure SAP HANA offerings, so I assume they just finally got Windows working on the SGI box.
cyberguyz - Friday, October 26, 2018 - link
Y'know PC is short for "Personal Computer". I think I can say with a straight face that if a system has 32 sockets, in ain't no stinkin' PC!
HollyDOL - Friday, October 26, 2018 - link
Some persons are more demanding than others :-)
HStewart - Friday, October 26, 2018 - link
Well you could say the same thing about 16 or 32 Core computer. But this is not actually intended as PC - but a server - but one could make a killer workstations

But who knows we maybe talking about 896 cores in laptop in 10 years
mode_13h - Saturday, October 27, 2018 - link
Yes and no. Yes, you're already talking about 896-core laptops *today*. But no, nobody will be *seriously* talking about such things.

The main reason being that core scaling is non-linear in both power and area. Also, Moore's law is dead.
peevee - Monday, October 29, 2018 - link
"The main reason being that core scaling is non-linear in both power and area."

If the basic architecture is Von Neuman-based.
mode_13h - Monday, October 29, 2018 - link
We're only extrapolating, here. If you're proposing some fundamentally new technology, like room-temperature quantum computers, then of course that would be a game-changer.

But such predictions need to be justified - not blue sky wishful thinking - and generally aren't relevant to the conversation.
hotaru - Friday, October 26, 2018 - link
more likely it's just a regular quad-socket system running a VM like this: httpsww.dragonflydigest.com/2018/02/26/20940.html
hotaru - Friday, October 26, 2018 - link
more likely it's just a regular quad-socket system running a VM like this: https://www.dragonflydigest.com/2018/02/26/20940.h...
deil - Friday, October 26, 2018 - link
In pic while updating. And ofc no way to update without interupting what you do + OFC configuration reset each update right ?
rocky12345 - Friday, October 26, 2018 - link
I liked seeing all of those Threads sitting at 99%-100% 896/1792 core/threads is powerful no matter how you look at or how it is setup.

The big question is "But can it play".....never mind...:) /jk
mode_13h - Friday, October 26, 2018 - link
Earlier this year, I would've said realtime ray tracing should be a good workload for it. But now, I doubt it would outpace a handful of RTX 2080 Ti's.
CheapSushi - Friday, October 26, 2018 - link
This was always my dream for a homelab. When searching for information it always gets muddied by the discussion about having many systems work on one problem. But what I wanted is many systems that look like one system that can handle many problems (such as what you get with 2P, 4P and 8P as shown). The single system image OS's & software are pretty dated now though. And much of it still complex in implementation especially for just a hobbyist. I don't have a use case but I just always wanted something along those lines. In my homelab with 6 systems, VMs, etc, it still feels disjointed.
kb9fcc - Friday, October 26, 2018 - link
And with Microsoft's per core licensing it's going to take at minimum 448 licenses (2-packs) for the low, low price of $2,757,440 just for Windows (2019 Datacenter Edition). Want to actually do something will all those cores, like run MS SQL server? Get ready to pony up another $6,386,688. Ouch.
rahvin - Friday, October 26, 2018 - link
You might be surprised to here this but the majority of companies running machines like this aren't running windows.

Shocking I know. This feels a lot like a look what we can do system rather than something anyone but a select few would really want.
kb9fcc - Friday, October 26, 2018 - link
<sarcasm> No, really? </sarcasm> However, besides this CPU behemoth, Microsoft is the central focus of this article.. Can it be done? Sure. But fueled by M$? meh
ytoledano - Saturday, October 27, 2018 - link
Windows Server 2019 Datacenter costs $6,155 for one 16 core license. 896 cores will cost $344,680
vFunct - Saturday, October 27, 2018 - link
At this point you might as well hire 10 Postgres core developers to update postgres for you..
Dug - Friday, November 2, 2018 - link
I miss 2012 pricing structure
MattZN - Friday, October 26, 2018 - link
It's an accomplishment to be able to connect so many CPUs together and still have a working system. Scaling kernels become more difficult at those core counts due to the fact that even non-contending shared lock latencies start to go through the roof due to the cache ping ponging. (Intel does have their transactional support instructions to help with this, but its a really problematic technology and only works for the simplest bits of code). Plus page table operations for kernel memory itself often winds up having to spam all the cores to synchronize the PTE changes. We've managed to reduce the IPI spamming by two orders of magnitude over the last decade, but having to do it at all creates a real problem on many-cores systems.

I agree that it just isn't all that cost effective these days to have more than 2 sockets. The motherboards and form factors are so complex and customized to such a high degree that there just isn't any cost savings to squeeze out beyond 4 sockets. And as we have seen, even 4 sockets has fallen into disfavor as computational power has increased.

I think the future is going to be more an AMD-style chiplet topology, where each cpu 'socket' in the system is actually running a complex of multiple CPU chips inside it, rather than requiring a physical socket for each one. There is no need for motherboards to expand beyond 2 sockets, honestly. Memory and compute density will continue to improve, bandwidth will continue to improve... 4+ socket monsters are just a waste of money when technology is moving this quickly. They are one-off custom jobs that become obsolete literally in less than a year.

-Matt
HStewart - Friday, October 26, 2018 - link
But with AMD designed - it would take a wafer of probably size of 32 in tv to make that happen.

Intel has better option for Custom jobs with EMiB - can even hold AMD GPU designed on and HDM2 memory.
MrSpadge - Friday, October 26, 2018 - link
Whatever they do to the scheduler for this beast should also help Epyc with an additional layer of NUMA compared to Intel.
mode_13h - Friday, October 26, 2018 - link
What if extending their dual-EPYC scheduler is actually what enabled *this*?

Granted, going over OmniPath probably incurs much more latency than talking to another EPYC CPU.
mode_13h - Friday, October 26, 2018 - link
I'd love to hear a good use case for this.

As pointed out in the article, the memory latency of accessing data on another blade just doesn't make sense to treat as a single system.
Alex_Haddock - Friday, November 2, 2018 - link
There are some massively in memory situations where this is absolutely of benefit. The blog below has a pretty good overview of the ASIC interconnects.

https://community.hpe.com/t5/Servers-The-Right-Com...

Interestingly there are plenty of cases where ultimate latency is less important than not having to move data from flash to RAM - in this case the ability to use Xeon Gold CPUs to support a ton of RAM and CPUs but without the extra cost is a real benefit, despite losing links.

In the long run datacenters are becoming more and more constrained in the power cost of moving data, this is what is leading to memory based computing that systems like SuperdomeFlex are the precursor too.
mode_13h - Monday, November 5, 2018 - link
Thanks. I'll check it out.
Akasbergen - Friday, October 26, 2018 - link
You may want to read up on HPE Superdome Flex.
sor - Friday, October 26, 2018 - link
I feel like we need to go back and look at that CPU0 article again and update the memes :)
watersb - Saturday, October 27, 2018 - link
I clicked through the Amazon link following this article, the qual-socket rated Xeon Gold 6134. I can add Amazon Expert Installation for only $83! And a 4-year Extended Product Warranty for only $18!

I need to get out of our current support contract...
CharonPDX - Saturday, October 27, 2018 - link
IBM has made "cluster" systems out of these high-end Xeons for a while - using direct PCI Express links between the individual server chassis via Infiniband plugs.

I thought it was impressive playing with one in 2010 that was two separate chassis linked together of eight, eight-core "Nehalem-EX" Xeons each. Seeing 256 threads and 2 TB of RAM in Task Manager was insane! Of course, at the time, those CPUs were $55,000, and the RAM was $100,000.
Machinus - Saturday, October 27, 2018 - link
what's a defence?
Ian Cutress - Saturday, October 27, 2018 - link
The correct way of spelling the word ;)
Machinus - Saturday, October 27, 2018 - link
For when you remove a fence.
The Von Matrices - Saturday, October 27, 2018 - link
Only 2 TB of ram for 32 sockets? With 6 memory channels per socket, that's an average of 10.6GB per DIMM. Basically the minimum amount of memory you can put in one of these servers. I guess that makes sense since any workload that would need lots of memory would be hindered by bandwidith/latency of the inter-socket interconnect anyway.
speculatrix - Sunday, October 28, 2018 - link
I think another factor, licensing conditions, also reduce the benefit of paying huge premiums for servers with massive core counts.
If you're buying VMware licenses you currently pay per socket, and there's a sweet spot in combined server price with high core cpu to get the best overall cost per core.
With Oracle, I think licensing is by core, not socket or server, so there's no benefit in stuffing a machine with higher sockets and higher core counts to optimise the oracle licensing cost.

so it's worth optimising the overall price of the entire server and license t
SarahKerrigan - Sunday, October 28, 2018 - link
As others have noted, the Superdome Flex supports Windows and is 32s. There have also been previous 32s and 64s IPF machines running Windows, although obviously the fact that this is x86 is a distinction from those. :-)
peevee - Monday, October 29, 2018 - link
"When I spoke to a large server OEM last year, they said quad socket and eight socket systems are becoming rarer and rarer as each CPU by itself has more cores the need for systems that big just doesn't exist anymore"

The real reason is rise of distributed computing (Hadoop, Spark and alike), which can achieve similar performance using much cheaper hardware (2-cpu servers). Those monster multi-socket systems allow you to write applications like for SMP, but in practice if you write for NUMA like for SMP it does not scale, you need to write for NUMA like it is a distributed thing anyway.
spikespiegal - Wednesday, October 31, 2018 - link
"The real reason is rise of distributed computing (Hadoop, Spark and alike), which can achieve similar performance using much cheaper hardware (2-cpu servers)."

??

The decline in > 2 socket servers is because those companies still running on prem data services don't see the value and the far bigger customer base which is cloud providers would rather scale horizontal than vertical. Aside from XenApp / RDS or the rare Database instance provisioning more than 4 CPUs to a single guest VM is the exception when it comes to looking at all virtual servers as an aggragate. Any time you start slinging more virtual CPUs to a single guest the odds are increasingly in favor of a cluster of those OS's hosting the application pool doing it more efficiently than a single OS. So, in all likelyhood this is a product that's designed to better facilitate spinning up large commercial virtual server environments and not solving a problem for a singular application.
DigiWise - Sunday, November 4, 2018 - link
I have a stupid question, the single 8180 CPU has 38.5MB L3 cashe, I see from the screenshot it's only 16MB for this massive count? how is it possible?
alpha754293 - Friday, January 11, 2019 - link
Dr. Cutress:

I don't have the specifics, but there are ways to present multiple hardware nodes to an OS as a single image.

There is a cluster-specific Linux distro called "Rocks Cluster" where it would be able to find all of the nodes on the cluster and then if you give it work to do, it will automatically "move" the processes over from the head node over to slave nodes and manage it entirely.

I would not be surprised if it is managing the system as a single-image OS if this is going through Intel's Omni-Path interconnect so that they have some form of RDMA (whether it'd be RoCE or not).

If they're quad socket blades, and you can fit 14 blades in 7U, then this will fit within that physical space. (Actually, with room to spare. Only 8 of the 14 quad socket blades would be populated). This can be powered by up to eight 2200 W redudant power supplies (using Supermicro's offerings as surrogate source of data, which would take upto 17.6 kW total, peak/max power consumption).)

The other possibility would be dual socket per node, quad half-width nodes per 2U. That would require four servers to get 32 sockets, and that will take 8U total physical space. If it is just uses four servers like this, and each server uses a 1600 W redudant power supplies, so that would put the total power consumption at 6.4 kW, so it's entirely possible.

Unfortunately, this is just my speculation based on some of the available hardware offerings (based on Supermicro's offerings).

896 Xeon Cores in One PC: Microsoft’s New x86 DataCenter Class Machines Running Windows

Post Your Comment

56 Comments

Back to Article

HStewart - Friday, October 26, 2018 - link

name99 - Friday, October 26, 2018 - link

HStewart - Friday, October 26, 2018 - link

mode_13h - Friday, October 26, 2018 - link

peevee - Monday, October 29, 2018 - link

cekim - Friday, October 26, 2018 - link

HStewart - Friday, October 26, 2018 - link

cekim - Friday, October 26, 2018 - link

HStewart - Friday, October 26, 2018 - link

mode_13h - Friday, October 26, 2018 - link

alpha754293 - Saturday, October 27, 2018 - link

theeldest - Friday, October 26, 2018 - link

HStewart - Friday, October 26, 2018 - link

Alex_Haddock - Friday, November 2, 2018 - link

arnd - Friday, October 26, 2018 - link

HStewart - Friday, October 26, 2018 - link

loony - Friday, October 26, 2018 - link

cyberguyz - Friday, October 26, 2018 - link

HollyDOL - Friday, October 26, 2018 - link

HStewart - Friday, October 26, 2018 - link

mode_13h - Saturday, October 27, 2018 - link

peevee - Monday, October 29, 2018 - link

mode_13h - Monday, October 29, 2018 - link

hotaru - Friday, October 26, 2018 - link

hotaru - Friday, October 26, 2018 - link

deil - Friday, October 26, 2018 - link

rocky12345 - Friday, October 26, 2018 - link

mode_13h - Friday, October 26, 2018 - link

CheapSushi - Friday, October 26, 2018 - link

kb9fcc - Friday, October 26, 2018 - link

rahvin - Friday, October 26, 2018 - link

kb9fcc - Friday, October 26, 2018 - link

ytoledano - Saturday, October 27, 2018 - link

vFunct - Saturday, October 27, 2018 - link

Dug - Friday, November 2, 2018 - link

MattZN - Friday, October 26, 2018 - link

HStewart - Friday, October 26, 2018 - link

MrSpadge - Friday, October 26, 2018 - link

mode_13h - Friday, October 26, 2018 - link

mode_13h - Friday, October 26, 2018 - link

Alex_Haddock - Friday, November 2, 2018 - link

mode_13h - Monday, November 5, 2018 - link

Akasbergen - Friday, October 26, 2018 - link

sor - Friday, October 26, 2018 - link

watersb - Saturday, October 27, 2018 - link

CharonPDX - Saturday, October 27, 2018 - link

Machinus - Saturday, October 27, 2018 - link

Ian Cutress - Saturday, October 27, 2018 - link

Machinus - Saturday, October 27, 2018 - link

The Von Matrices - Saturday, October 27, 2018 - link

speculatrix - Sunday, October 28, 2018 - link

SarahKerrigan - Sunday, October 28, 2018 - link

peevee - Monday, October 29, 2018 - link

spikespiegal - Wednesday, October 31, 2018 - link

DigiWise - Sunday, November 4, 2018 - link

alpha754293 - Friday, January 11, 2019 - link

Log in

Don't have an account? Sign up now