Name: Intel Xeon E5 Version 3: Up to 18 Haswell EP Cores
Item: Intel Xeon E5 Version 3: Up to 18 Haswell EP Cores
Author: Johan De Gelas

Intel Xeon E5 Version 3: Up to 18 Haswell EP Cores

by Johan De Gelas on 9/8/2014 12:30 PM EST

Post Your Comment
Please log in or sign up to comment.

Comments Locked

85 Comments

Back to Article

coburn_c - Monday, September 8, 2014 - link
MY God - It's full of transistors!
Samus - Monday, September 8, 2014 - link
I wish there were socket 1150 Xeon's in this class. If I could replace my quad core with an Octacore...
wireframed - Saturday, September 20, 2014 - link
If you can afford an 8-core CPU, I'm sure you can afford a S2011 board - it's like 15% of the price of the CPU, so the cost relative to the rest of the platform is negligible. :)
Also, s1150 is dual-channel only. With that many cores, you'll want more bandwidth.
peevee - Wednesday, March 25, 2015 - link
For many, if not most workloads it will be faster to run 4 fast (4GHz) cores on 4 fast memory channels (DDR4-2400+) than 8 slow (2-3GHz) cores on 2 memory channels. Of course, if your workload consists of a lot of trigonometry (sine/cosine etc), or thread worksets completely fit into 2nd level cache (only 256k!), you may benefit from 8/2 config. But if you have one of those, I am eager to hear what it is.
tech6 - Monday, September 8, 2014 - link
The 18 core SKU is great news for those trying to increase data center density. It should allow VM hosts with 512Gb+ of memory to operate efficiently even under demanding workloads. Given the new DDR4 memory bandwidth gains I wonder if the 18 core dual socket SKUs will make quad socket servers a niche product?
Kevin G - Monday, September 8, 2014 - link
In fairness, quad socket was already a niche market.

That and there will be quad socket version of these chips: E5-4600v3's.
wallysb01 - Monday, September 8, 2014 - link
My lord. My thought is that this really shows that v3 isn’t the slouch many thought it would be. An added 2 cores over v2 in the same price range and turbo boosting that appears to functioning a little better, plus the clock for clock improvements and move to DDR4 make for a nice step up when all combined.

I’m surprised Intel went with an 18 core monster, but holy S&%T, if they can squeeze it in and make it function, why not.
Samus - Monday, September 8, 2014 - link
I feel for AMD, this just shows how far ahead Intel is :\
Thermogenic - Monday, September 8, 2014 - link
Intel isn't just ahead - they've already won.
olderkid - Monday, September 8, 2014 - link
AMD saw Intel behind them and they wondered how Intel fell so far back. But really Intel was just lapping them.
MorinMoss - Friday, August 9, 2019 - link
Hello from 2019.
AMD has a LOT of ground to make up but it's a new world and a new race
https://www.anandtech.com/show/14605/the-and-ryzen...
Kevin G - Monday, September 8, 2014 - link
As an owner of a dual Opteron 6376 system, I shudder at how far behind that platform is. Then I look down and see that I have both of my kidneys as I didn't need to sell one for a pair of Xeons so I don't feel so bad. For the price of one E5-2660v3 I was able to pick up two Opteron 6376's.
wallysb01 - Monday, September 8, 2014 - link
But the rest of the system cost is about the same. So you get 1/2 the performance for a 10% discount. YEPPY!
Kevin G - Monday, September 8, 2014 - link
Nope. Build price after all the upgrades over the course of two years is some where around $3600 USD. The two Opterons accounted for a bit more than a third of that price. Not bad for 32 cores and 128 GB of memory. Even with Haswell-E being twice as fast, I'd have to spend nearly twice as much (CPU's cost twice as much as does DDR4 compared to when I bought my DDR3 memory). To put it into prespective, a single Xeon E5 2999v3 might be faster than my build but I was able to build an entire system for less than the price Intel's flagship server CPU.

I will say something odd - component prices have increased since I purchased parts. RAM prices have gone up by 50% and the motherboard I use has seemingly increased in price by $100 due to scarcity. Enthusiast video card prices have also gotten crazy over the past couple of years so a high end video card is $100 more for top of the line in the consumer space.
wallysb01 - Tuesday, September 9, 2014 - link
Going to the E5 2699 isn’t needed. A pair of 2660 v3s is probably going to be nearly 2x as fast the 6376, especially for floating point where your 32 cores are more like 16 cores or for jobs that can’t use very many threads. True a pair of 2660s will be twice as expensive. On a total system it would add about $1.5K. We’ll have to wait for the workstation slanted view, but for an extra $1.5K, you’d probably have a workstation that’s much better at most tasks.
Kevin G - Friday, September 12, 2014 - link
Actually if you're aiming to double the performance of a dual Opteron 6376, two E5-2695v3's look to be a good pick for that target according to this review. A pair of those will set you pack $4848 which is more than what my complete system build cost.

Processors are only one component. So while a dual Xeon E5-2695v3 system would be twice as fast, total system cost is also approaching double due to memory and motherboard pricing differences.
Kahenraz - Monday, September 8, 2014 - link
I'm running a 6376 server as well and, although I too yearn for improved single-threaded performance, I could actually afford to own this one. As delicious as these Intel processors are, they are not priced for us mere mortals.

From a price/performance standpoint, I would still build another Opteron server unless I knew that single-threaded performance was critical.
JDG1980 - Tuesday, September 9, 2014 - link
The E5-2630 v3 is cheaper than the Opteron 6376 and I would be very surprised if it didn't offer better performance.
Kahenraz - Tuesday, September 9, 2014 - link
6376s can be had very cheaply on the second-hand market, especially bundled with a motherboard. Additionally, the E5-2630 v3 requires both a premium on the board and DDR4 memory.

I'd wager you could still build an Opteron 6376 system for half or less.
Kevin G - Tuesday, September 9, 2014 - link
It'd only be fair to go with the second hand market for the E5-2630v3's but being new means they don't exist. :)

Still going by new prices, an Opteron 6376 will be cheaper but roughly 33% from what I can tell. You're correct that the new Xeon's have a premium pricing on motherboards and DDR4 memory.
LostAlone - Saturday, September 20, 2014 - link
Given the difference in size between the two companies it's not really all that surprising though. Intel are ten times AMD's size, and I have to imagine that Intel's chip R&D department budget alone is bigger than the whole of AMD. And that is sad really, because I'm sure most of us were learning our computer science when AMD were setting the world on fire, so it's tough to see our young loves go off the rails. But Intel have the money to spend, and can pursue so many more potential avenues for improvement than AMD and that's what makes the difference.
Kevin G - Monday, September 8, 2014 - link
I'm actually surprised they released the 18 core chip for the EP line. In the Ivy Bridge generation, it was the 15 core EX die that was harvested for the 12 core models. I was expecting the same thing here with the 14 core models, though more to do with power binning than raw yields.

I guess with the recent TSX errata, Intel is just dumping all of the existing EX dies into the EP socket. That is a good means of clearing inventory of a notably buggy chip. When Haswell-EX formally launches, it'll be of a stepping with the TSX bug resolved.
SanX - Monday, September 8, 2014 - link
You have teased us with the claim that added FMA instructions have double floating point performance. Wow! Is this still possible to do that with FP which are already close to the limit approaching just one clock cycle? This was good review of integer related performance but please combine with Ian to continue with the FP one.
JohanAnandtech - Monday, September 8, 2014 - link
Ian is working on his workstation oriented review of the latest Xeon
Kevin G - Monday, September 8, 2014 - link
FMA is common place in many RISC architectures. The reason why we're just seeing it now on x86 is that until recently, the ISA only permitted two registers per operand.

Improvements in this area maybe coming down the line even for legacy code. Intel's micro-op fusion has the potential to take an ordinary multiply and add and fuse them into one FMA operation internally. This type of optimization is something I'd like to see in a future architecture (Sky Lake?).
valarauca - Monday, September 8, 2014 - link
The Intel compiler suite I believe already converts

x *= y;
x += z;

into an FMA operation when confronted with them.
Kevin G - Monday, September 8, 2014 - link
That's with source that is going to be compiled. (And don't get me wrong, that's what a compiler should do!)

Micro-op fusion works on existing binaries years old so there is no recompile necessary. However, micro-op fusion may not work in all situations depending on the actual instruction stream. (Hypothetically the fusion of a multiply and an add in an instruction stream may have to be adjacent to work but an ancient compiler could have slipped in some other instructions in between them to hide execution latencies as an optimization so it'd never work in that binary.)
DIYEyal - Monday, September 8, 2014 - link
Very interesting read.
And I think I found a typo: page 5 (power optimization). It is well known that THE (not needed) Haswell HAS (is/ has been) optimized for low idle power.
vLsL2VnDmWjoTByaVLxb - Monday, September 8, 2014 - link
Colors or labeling for your HPC Power Consumption graph don't seem right.
JohanAnandtech - Monday, September 8, 2014 - link
Fixed, thanks for pointing it out.
cmikeh2 - Monday, September 8, 2014 - link
In the SKU comparison table you have the E5-2690V2 listed as a 12/24 part when it is in fact a 10/20 part. Just a tiny quibble. Overall a fantastic read.
KAlmquist - Monday, September 8, 2014 - link
Also, the 2637 v2 is 4/8, not 6/12.
isa - Monday, September 8, 2014 - link
Looking forward to a new supercomputer record using these behemoths.
Bruce Allen - Monday, September 8, 2014 - link
Awesome article. I'd love to see Cinebench and other applications tests. We do a lot of rendering (currently with older dual Xeons) and would love to compare these new Xeons versus the new 5960X chips - software license costs per computer are so high that the 5960X setups will need much higher price/performance to be worth it. We actually use Cinema 4D in production so those scores are relevant. We use V-Ray, Mental Ray and Arnold for Maya too but in general those track with the Cinebench scores so they are a decent guide. Thank you!
Ian Cutress - Monday, September 8, 2014 - link
I've got some E5 v3 Xeons in for a more workstation oriented review. Look out for that soon :)
fastgeek - Monday, September 8, 2014 - link
From my notes a while back... two E5-2690 v3's (all cores + turbo enabled) under 2012 Server yielded 3,129 for multithreaded and 79 for single.

While not Haswell, I can tell you that four E5-4657L V2's returned 4,722 / 94 respectively.

Hope that helps somewhat. :-)
fastgeek - Monday, September 8, 2014 - link
I don't see a way to edit my previous comment; but those scores were from Cinebench R15
wireframed - Saturday, September 20, 2014 - link
You pay for licenses for render Nodes? Switch to 3DS, and you get 9999 nodes for free (unless they changed the licensing since I last checked). :)
Lone Ranger - Monday, September 8, 2014 - link
You make mention that the large core count chips are pretty good about raising their clock rate when only a few cores are active. Under Linux, what is the best way to see actual turbo frequencies? cpuinfo doesn't show live/actual clock rate.
JohanAnandtech - Monday, September 8, 2014 - link
The best way to do this is using Intel's PCM. However, this does not work right now (only on Sandy and Ivy, not Haswel) . I deduced it from the fact that performance was almost identical and previous profiling of some of our benchmarks.
martinpw - Monday, September 8, 2014 - link
There is a nice tool called i7z (can google it). You need to run it as root to get the live CPU clock display.
kepstin - Monday, September 8, 2014 - link
Most Linux distributions provide a tool called "turbostat" which prints statistical summaries of real clock speeds and c state usage on Intel cpus.
kepstin - Monday, September 8, 2014 - link
Note that if turbostat is missing or too old (doesn't support your cpu), you can build it yourself pretty quick - grab the latest linux kernel source, cd to tools/power/x86/turbostat, and type 'make'. It'll build the tool in the current directory.
julianb - Monday, September 8, 2014 - link
Finally the e5-xxx v3s have arrived. I too can't wait for the Cinebench and 3DS Max benchmark results.
Any idea if now that they are out the e5-xxxx v2s will drop down in price?
Or Intel doesn't do that...
MrSpadge - Tuesday, September 9, 2014 - link
Correct, Intel does not really lower prices of older CPUs. They just gradually phase out.
tromp - Monday, September 8, 2014 - link
As an additional test of the latency of the DRAM subsystem, could you please run the "make speedup" scaling benchmark of my Cuckoo Cycle proof-of-work system at https://github.com/tromp/cuckoo ?
That will show if 72 threads (2 cpus with 18 hyperthreaded cores) suffice to saturate the DRAM subsystem with random accesses.

-John
Hulk - Monday, September 8, 2014 - link
I know this is not the workload these parts are designed for, but just for kicks I'd love to see some media encoding/video editing apps tested. Just to see what this thing can do with a well coded mainstream application. Or to see where the apps fades out core-wise.
Assimilator87 - Monday, September 8, 2014 - link
Someone benchmark F@H bigadv on these, stat!
iwod - Tuesday, September 9, 2014 - link
I am looking forward to 16 Core Native Die, 14nm Broadwell Next year, and DDR4 is matured with much better pricing.
Brutalizer - Tuesday, September 9, 2014 - link
Yawn, the new upcoming SPARC M7 cpu has 32 cores. SPARC has had 16 cores for ages. Since some generations back, the SPARC cores are able to dedicate all resources to one thread if need be. This way the SPARC core can have one very strong thread, or massive throughput (many threads). The SPARC M7 cpu is 10 billion transistors:
http://www.enterprisetech.com/2014/08/13/oracle-cr...
and it will be 3-4x faster than the current SPARC M6 (12 cores, 96 threads) which holds several world records today. The largest SPARC M7 server will have 32-sockets, 1024 cores, 64TB RAM and 8.192 threads. One SPARC M7 cpu will be as fast as an entire Sunfire 25K. :)

The largest Xeon E5 server will top out at 4-sockets probably. I think the Xeon E7 cpus top out at 8-socket servers. So, if you need massive RAM (more than 10TB) and massive performance, you need to venture into Unix server territory, such as SPARC or POWER. Only they have 32-socket servers capable of reaching the highest performance.

Of course, the SGI Altix/UV2000 servers have 10.000s of cores and 100TBs of RAM, but they are clusters, like a tiny supercomputer. Only doing HPC number crunching workloads. You will never find these large Linux clusters run SAP Enterprise workloads, there are no such SAP benchmarks, because clusters suck at non HPC workloads.

-Clusters are typically serving one user who picks which workload to run for the next days. All SGI benchmarks are HPC, not a single Enterprise benchmark exist for instance SAP or other Enterprise systems. They serve one user.

-Large SMP servers with as many as 32 sockets (or even 64-sockets!!!) are typically serving thousands of users, running Enterprise business workloads, such as SAP. They serve thousands of users.
Assimilator87 - Tuesday, September 9, 2014 - link
That's great and all, but there's one huge flaw with SPARC processors. They cannot run Crysis.
TiGr1982 - Tuesday, September 9, 2014 - link
But for the case of Xeon E5, are you ready to spend several grands to play Crysis?
This kind of "joke" about "can it run Crysis" is really several years old, stop posting it please.
quixver - Tuesday, September 9, 2014 - link
SAP Hana is designed to run on a cluster of Linux nodes. And I believe HANA is going to be the only supported data store for SAP. And there are a large number of domains who have moved away from big *nixes. Are there use cases for big iron still? Sure. Are there use cases for commodity Xeon/Linux boxes? Sure. Posts like yours remind me of pitches from IBM/HP/Sun where they do nothing but spread fud.
Brutalizer - Wednesday, September 10, 2014 - link
HANA is running on a scale-out cluster yes. But there are workloads that you can not run on a cluster. You need a huge single SMP server, for scaling-up instead. Scale-out (cluster) vs scale-up (one single huge server):
https://news.ycombinator.com/item?id=8175726
"...I'm not saying that Oracle hardware or software is the solution, but "scaling-out" is incredibly difficult in transaction processing. I worked at a mid-size tech company with what I imagine was a fairly typical workload, and we spent a ton of money on database hardware because it would have been either incredibly complicated or slow to maintain data integrity across multiple machines...."

"....Generally it's just that [scaling-out] really difficult to do it right. Sometime's it's impossible. It's often loads more work (which can be hard to debug). Furthermore, it's frequently not even an advantage. Have a read of https://research.microsoft.com/pubs/163083/hotcbp1... Remember corporate workloads frequently have very different requirements than consumer..."

So the SPARC M7 server with 64TB in one single huge SMP server will be much faster than a HANA cluster running Enterprise software. Besides, the HANA cluster also tops out at 64TB RAM, just like the SPARC M7 server. An SMP server will always be faster than a cluster, because in a cluster nodes will be far away worst case, than in a tightly knit SMP server.

Here is an article about clusters vs SMP servers by the die-hard IBM fan Timothy Prickett Morgan (yes, IBM supporters hates Oracle):
http://www.enterprisetech.com/2014/02/14/math-big-...
shodanshok - Tuesday, September 9, 2014 - link
The top-of-the-line Haswell-EP discussed in this article costs about 4000$. A single mid-level SPARC T4 CPU costs over 12000$, and a T5 over 20000$. So, what are comparing?

Moreover, while SPARC T4/T5 can dedicate all core resource to a single running thread (dynamic threading), the single S3 core powering T4/T5 is relatively simple, with 2 interger ALU and a single, weak FPU (but with some interesting feature, as a dedicated crypto block). In other words, they can not be compared to an Haswell core (read: they are slower).

The M6 CPU comparison is even worse: it costs even more (but I can find no precise number, sorry - and that alone tell much about the comparison story!), but the enlarged L3 cache can not magically solve all performance limitations. The only saving grace, at least for M6 CPU, is that it can address very large memory, with significant implications for those workloads that really need insane RAM amount.

M7 will be release next year - it is useless to throw it into the mix now.

T4/T5 and M6/M7 real competitor is Intel EX serie, which can scale up to 256 sockets if equipped with QPI switchs (while glueless systems scale to 8 sockets) and can address very large RAM. Moreover, the EX serie implement the (paranoid) reliability requirements asked in this very specific market niche.

Regards.
Brutalizer - Wednesday, September 10, 2014 - link
"...In other words, the old S3 core can not be compared to an Haswell core (read: they are slower)..."

Well, you do know that the S3 core powered SPARC T5 holds several world records today? For instance, T5 is faster than the Xeon cpus at SPEC2006 cpu benchmarks:
https://blogs.oracle.com/BestPerf/entry/20130326_s...
The reason they benchmark against that particular Xeon E7 model, is because only it scales to 8-sockets, just like the SPARC T5. The E5 does not scale to 8-sockets. So the old S3 core which "does not compare to haswell core" seems to be faster in many benchmarks.

.

.

Intel EX scaling up to 256 sockets is just a cluster. Basically, a SGI Altix or UV2000 server. Clusters will never be able to serve thousands of users running Enterprise software. Such clusters are only fit serving one single user, running HPC number crunching benchmarks for days at a time. No sane person will try to run Enterprise business software on a cluster. Even SGI admits this, talking about their Altix server:
http://www.realworldtech.com/sgi-interview/6/
"...The success of Altix systems in the high performance computing market are a very positive sign for both Linux and Itanium. Clearly, the popularity of large processor count Altix systems dispels any notions of whether Linux is a scalable OS for scientific applications. Linux is quite popular for HPC and will continue to remain so in the future,...However, scientific applications (HPC) have very different operating characteristics from commercial applications (SMP). Typically, much of the work in scientific code is done inside loops, whereas commercial applications, such as database or ERP software are far more branch intensive. This makes the memory hierarchy more important, particularly the latency to main memory. Whether Linux can scale well with a SMP workload is an open question. However, there is no doubt that with each passing month, the scalability in such environments will improve. Unfortunately, SGI has no plans to move into this SMP market, at this point in time..."

Also, ScaleMP confirms that their huge Linux server (also 10.000s of cores, 100TBs of RAM) is also a cluster, capable only of running HPC number crunching workloads:
http://www.theregister.co.uk/2011/09/20/scalemp_su...
"...Since its founding in 2003, ScaleMP has tried a different approach. Instead of using special ASICs and interconnection protocols to lash together multiple server modes together into a SMP shared memory system, ScaleMP cooked up a special software hypervisor layer, called vSMP, that rides atop the x64 processors, memory controllers, and I/O controllers in multiple server nodes....vSMP takes multiple physical servers and – using InfiniBand as a backplane interconnect – makes them look like a giant virtual SMP server with a shared memory space. vSMP has its limits....The vSMP hypervisor that glues systems together is not for every workload, but on workloads where there is a lot of message passing between server nodes – financial modeling, supercomputing, data analytics, and similar parallel workloads. Shai Fultheim, the company's founder and chief executive officer, says ScaleMP has over 300 customers now. "We focused on HPC as the low-hanging fruit..."

You will never find Enterprise benchmarks for the SGI server or other Xeon 256-socket servers such as ScaleMP, such as SAP benchmarks. Because they are clusters. Read the links, if you dont believe me.
shodanshok - Saturday, September 13, 2014 - link
Hi,
the 256 socket E7 system is not a cluster: it uses QPI switches to connect the various sockets and it run a single kernel image. After all, clusters scale to 1000s CPU, while single-image E7 systems are limited to 256 sockets max.

Regarding SPEC benchmark, please note that Oracle posts SPEC2006 _RATE_ (multi-threaded throughput) results only. For the very nature of Niagaras (T1-T5) processors, the RATE score is very competitive: they are barrel processors, and obviously they excel in total aggregate throughput. The funny thing is that in throughput per socket base, Ivy Bridge is already a competitive or even superior choice. Let use SPEC number (but remember that compiler optimization play a critical role here):

Oracle Corporation SPARC T5-8 (8 sockets, 128 cores, 1024 threads)
SPECint® 2006: not published
SPECint®_rate 2006: 3750
Per socket perf: 468.75
Per core perf: 29.27
Per thread perf: ~3.66
LINK: http://www.spec.org/cpu2006/results/res2013q2/cpu2...

PowerEdge R920 (Intel Xeon E7-4890 v2, 4 sockets, 60 cores, 120 threads)
SPECint® 2006: 59.7
SPECint®_rate 2006: 2390
Per socket perf: 597.5
Per core perf: 39.83
Per thread perf: 19.92
LINK: http://www.spec.org/cpu2006/results/res2014q1/cpu2...
LINK: http://www.spec.org/cpu2006/results/res2014q1/cpu2...

As you can see, from any metrics the S3 core is not a match for a Ivy core, let alone Haswell. Moreover, T1/T2/T3 did not have the dynamic threading features, leading to always-active barrel processing resulting in extremely slow single-thread processing (I saw relatively simple queries that took 18 _minutes_ on a T2 and no, disk weren't busy, but totally idling).

Sun's latency-oriented processor (the one really intended for big unix boxes) was the Rock processor, but it was cancelled without a clear explaination. LINK: http://en.wikipedia.org/wiki/Rock_(processor)

The funny thing about all this post, is that performance alone don't matter: _reliability_ was the real plus of big iron Unixes, followed by memory capacity. It is for these specific reason that Intel is not so verbose about Xeon E7 performance, while it really stress the added reliability and serviceability (RAS) features of the new platform, coupled with the SMI links that greatly increased memory capacity.

In the end, there is surely space for big, RISC, proprietary Unix boxes, however this space is shrinking rapidly.

Regards.
Brutalizer - Monday, September 15, 2014 - link
"...the 256 socket E7 system is not a cluster: it uses QPI switches to connect the various sockets and it run a single kernel image..."

Well, they are a cluster, running a single Linux kernel image, yes. Have you ever seen benchmarks that are NOT clustered? Nope. You only see HPC benchmarks. For instance, no SAP benchmarks there.

Here is an example of such a Linux server we talk of; the ScaleMP server which has 10.000s of cores and 100TBs of RAM and run Linux. The server scales to 256 sockets, using a variant of QPI.
http://www.theregister.co.uk/2011/09/20/scalemp_su...
"...Since its founding in 2003, ScaleMP has tried a different approach [than building complex hardware]. Instead of using special ASICs and interconnection protocols to lash together multiple server modes together into a shared memory system, ScaleMP cooked up a special hypervisor layer, called vSMP, that rides atop the x64 processors, memory controllers, and I/O controllers in multiple server nodes. Rather than carve up a single system image into multiple virtual machines, vSMP takes multiple physical servers and – using InfiniBand as a backplane interconnect – makes them look like a giant virtual SMP server with a shared memory space. vSMP has its limits.

The vSMP hypervisor that glues systems together is not for every workload, but on workloads where there is a lot of message passing between server nodes – financial modeling, supercomputing, data analytics, and similar HPC parallel workloads. Shai Fultheim, the company's founder and chief executive officer, says ScaleMP has over 300 customers now. "We focused on HPC as the low-hanging fruit"

From 2013:
http://www.theregister.co.uk/2013/10/02/cray_turns...
"...Big SMP systems are very expensive to design and manufacture. Building your own proprietary crossbar interconnect and associated big SMP tech, is a very expensive proposition. But SMP systems provide performance advantages that are hard to match with distributed systems....The beauty of using ScaleMP on a cluster is that large shared memory, single o/s systems can be easily built and then reconfigured at will. When you need a hella large SMP box, you can one with as many as 128 nodes and 256 TB of memory..."

I have another link where SGI says their large Linux Altix box, is only for HPC cluster workloads. They dont do SMP workloads. Something that Enterprise software branches too much, but HPC workloads only run tight for loops - suitable to running on each node.

.

.

Regarding the SpecINT2006 benchmarks you showed, the SPARC T5 is a server processor. That means throughput, serving as many thousands of clients as possible in a given time. IT does not really matter if a client gets the answer in 0.2 or 0.6 seconds, as long as the server can serve 15.000 clients, instead of 800 clients as desktop cpus do. Desktop cpus focus on few strong threads, having low throughput. My point is that the old SPARC T5 does very well even on single threaded benchmarks, but is made for server loads (i.e. throughput) and crushes.

Let us see the the benchmarks of the announced SPARC M7 and compare them to x86 or IBM POWER8 or anyone else. It will annihilate. It has several TB/sec bandwidth, it is truly a server cpu.
shodanshok - Monday, September 15, 2014 - link
Hi,
I remember Fujitsu clearly stating that the EX platform make possible to use 256 sockets in _single image_ server, without using any vSMP tech. Moreover, from my understanding vSMP uses Infiniband as back-end, not QPI (or HT): http://www.scalemp.com/technology/versatile-smp-vs...

Anyway, as I don't personally manage any of these systems, I can go wrong. However, any Internet reference I can found seems to prove my understanding: http://www.infoworld.com/d/computer-hardware/infow... Can you provide me doc link about the maximum EX's sockets limit?

Regarding your CPU evaluation, sorry sir, but you are _very_ wrong. Interactive systems (web server, OLTP, SAP, etc) are very latency sensitive. If to show a billing order the user had to wait for 18 minutes (!), it will be very upset.

Server CPU surely are better throughput-oriented that their desktop cousins. However, latency remain a significant player in server space. After all this is the very reason for which Oracle enhanced its T series with dynamic threading (having cancelled Rock).

The best server CPU (note: I am _not_ saying that Xeon are the best CPU) will have a high throughput within a reasonable latency threshould (99% percentile is a good target), all with reasonable power consumption (to no throw efficiency out of the window).

Regards.
Brutalizer - Tuesday, September 16, 2014 - link
You talk about 256-socket x86 servers such as the SGI Altix or UV2000 servers, not being a cluster. Fine for you. Have you ever wondered why the big mature Enterprise Unix and Mainframes are stuck at 32-sockets after decades of R&D? They have tried for decades to increase performance, for instance, Fujitsu has a 64-socket SPARC server right now, called M10-4S. Why are there no 128-socket mature Unix or Mainframe servers? And suddenly, out of nowhere, comes a small player, and their first Linux server has 128 or 256 sockets. Does this sound reasonable to you? A small player succeeds what IBM, HP and Oracle/Sun never did, pouring billions and decades and vast armies of engineers and researchers? Why are there not more 64-socket Unix servers than only Fujitsus? Why are there no 256-socket Unix servers? Can you answer me this?

And another question) why are all benchmarks from 256-socket Altix, ScaleMP servers - only cluster HPC benchmarks? No SMP enterprise benchmarks? Why does everyone use these servers for HPC cluster workloads?

SGI talks about their large Linux server Altix here, which has 10.000s of cores (Is it 128 or 256 sockets?)
http://www.realworldtech.com/sgi-interview/6/
"...The success of Altix systems in the high performance computing market are a very positive sign for both Linux and Itanium. Clearly, the popularity of large processor count Altix systems dispels any notions of whether Linux is a scalable OS for scientific applications. Linux is quite popular for HPC and will continue to remain so in the future,...However, scientific applications (HPC) have very different operating characteristics from commercial applications (SMP). Typically, much of the work in scientific code is done inside loops, whereas commercial applications, such as database or ERP software are far more branch intensive. This makes the memory hierarchy more important, particularly the latency to main memory. Whether Linux can scale well with a SMP workload is an open question. However, there is no doubt that with each passing month, the scalability in such environments will improve. Unfortunately, SGI has no plans to move into this SMP market, at this point in time..."

Here, SGI says explictly that their Altix and the successor UV2000 is only for HPC cluster workloads - and not for SMP workloads. Just as ScaleMP says regarding their large Linux server with 10.000s of cores.

Both of them say these servers are only suitable for cluster workloads. And there are only cluster benchmarks out there. Ergo, in reality, these large Linux servers behave like clusters. If they only can run cluster workloads, then they are clusters.

I invite you to post links to any large Linux server, with as many as 32-sockets. Hint: you will not find any. The largest Linux server ever sold, has 8-sockets. It is just a plain 8-socket x86 server, sold by IBM, HP or Oracle. Linux has never scaled more than 8-sockets, because larger servers does not exist. Now I talk about SMP servers, not clusters. If you go to Linux clusters, you can find 10.000 cores and larger. But when you need to do SMP Enterprise business workloads, the largest Linux server is 8-sockets. Again: I invite you to post links to a larger Linux server than 8-sockets which is not a cluster, and not a Unix RISC server that someone tried to compile Linux onto. Examples of the latter is the IBM AIX P795 Unix server, or HP Itanium HP-UX "Big Tux" Unix server. They are Unix servers.

So, the largest Linux server has 8-sockets. Any larger Linux server than that, is a cluster - because they can only run cluster workloads as the manufacturers say. Linux scales terribly on a SMP server, 8-sockets seems to be limit.
shodanshok - Tuesday, September 16, 2014 - link
Hi,
Please note that the RWT article you are endlessy posting is 10 (TEN!) years ago.

SGI tell the extact contrary of what you reports:
https://www.sgi.com/pdfs/4227.pdf

Altrix UV systems are shared memory system connecting the various boards (4-8 sockets per board) via QPI and NUMAlink. They basically are a distribuited version of your beloved scale-up server. After all, maximum memory limit is 16 TB, which is the address space _a single xeon_ can address.

I am NOT saying that commodity X86 hardware can replace proprietary, big boxes in every environment. What I am saying it that the market nice for bix unix boxes is rapidly shrinking.

So, to recap:
1) in an article about Xeon E5 (4000$ max) you talk about the the mighty M7 (which are NOT available) with will probably cost 10-20X (and even T4/T5 are 3-5X);

2) you speak about SPECInt2006 conveniently skipping about anything other that throughput, totalling ignoring latency and per-thread perf (and event in pure throughput Xeons are very competitive at a fraction of the costs)

3) you totally ignore the fact that QPI and NUMAlink enable multi-board system to act as a single one, running a single kernel image within a shared memory environment.

Don't let me wrong: I am not an Intel fan, but I must say I'm impressed with the Xeons it is releasing since 4 years (from Nehalem EX). Even their (small) Itanium niche is at risk, attacked by higher end E7 systems.

Maybe (and hopefully) Power8 and M7 will be earth-shattering, but they will surely cost much, much more...

Regards.
Brutalizer - Friday, September 19, 2014 - link
This is folly. The link I post where SGI says their "Altix" server is only for HPC clustered workloads, applies also today to the "Altix" successor: the "Altix UV". Fact is that no large Unix or Mainframe vendor has successfully scaled beyond 32/64 sockets. And now SGI, a small cluster vendor with tiny resources, claims to have 256 socket server, with tiny resources compared to the large Unix companies?? Has SGI succeeded where no one else has, pouring decades and billions of R&D?

As a response you post a link where SGI talks about their "Altix UV", and you claim that link as a evidence that the Altix UV server is not a cluster. Well, if you bothered to read your link, you would see that SGI has not change their viewpoint: it is only for HPC clustered workloads. For instance, "Altix UV" talks about MPI. MPI is only used in clusters, mainly for number crunching. I have worked with MPI in scientific computations, so I know this. No one would use MPI in a SMP server, such as the Oracle M7. Anyone talking about MPI, is also talking about clusters. For instance, enterprise software such as SAP does not use MPI.

As a coup de grace, I quote text from your link about the latest "Altix UV" server:
"...The key enabling feature of SGI Altix UV is the NUMAlink 5 interconnect, with additional performance characteristics contributed by the on-node hub and its MPI Offload Engine (MOE)...MOE is designed to take MPI process communications off the microprocessors, thereby reducing CPU overhead and lowering memory access latency, and thus improving MPI application performance and scalability. MOE allows MPI tasks to be handled in the hub, freeing the processor for computation. This highly desirable concept is being pursued by various switch providers in the HPC cluster arena;
...
But fundamentally, HPC is about what the user can achieve, and it is this holy quest that SGI has always strived to enable with its architectures..."

Maybe this is the reason you will not find SAP benchmarks on the largest "Altix UV" server? Because it is a cluster.

But of course, you are free to disprove me by posting SAP benchmarks on a large Linux server with 10.000s of cores (i.e. clusters). I agree that if that SGI cluster runs SAP faster than a SMP 32-socket server - then it does not matter if SGI is cluster or not. The point is; clusters can not run all workloads, they suck at Enterprise workloads. If they can run Enterprise workloads, then I change my mind. Because, in the end, it does not matter how the hardware is constructed, as long as it can run SAP fast enough. But clusters can not.

Post SAP benchmarks on a large Linux server. Go ahead. Prove me wrong when you say they are not clusters - in that case they would be able to handle non clustered workloads such as SAP. :)
shodanshok - Friday, September 19, 2014 - link
Brutalizer, I am NOT (NOT!!!) saying that x86 is the best-of-world in scale-up performance. After all, it remain commodity hardware, and some choices clearly reflect that. For example, while Intel put single-image systems at as much as 256 sockets, the latency induced by the switchs/interconnect surely put the real number way lower.

What I am saying in that the market that truly need big Unix boxes is rapidly shrinking, so your comment about how "mediocre" is this new 18-core monster are totally off place.

Please note that:
1) Altrix UV are SHARED MEMORY systems built out of clusters, where the "secret sauce" is the added tech behind NUMAlink. Will SAP run well on these systems? I think no: the NUMAlinks add too much latency. However, this same tech can be used in a number of cases where big unix boxes where the first choice (at least in SGI words, I don't have a similar system (unfortunately!) so I can't tell more;

2) HP has just released the SAP HANA benchmarks for 16 sockets Intel E7 in scale-up configuration (read: single system) and 12/16 TB of RAM
LINK1 :http://h30507.www3.hp.com/t5/Converged-Infrastruct...
LINK2: http://h30507.www3.hp.com/t5/Reality-Check-Server-...
LINK3: http://h20195.www2.hp.com/V2/GetPDF.aspx%2F4AA5-14...

3) Even at 8 sockets, the Intel systems are very competitive. Please read here for some benchmarks: http://www.anandtech.com/show/7757/quad-ivy-brigde...
Long story short: an 8S Intel E7-8890 (15 cores @ 2.8 GHz) beat an 8S Oracle T5-8 (16 cores @ 3.6 GHz) by a significant margin. Now think about 18 Haswell cores...

4) On top of that, event high-end E7 Intel x86 systems are way cheaper that Oracle/IBM box, while providing similar performances. The real differentation are the extreme RAS features integrated into proprietary unix boxes (eg: lockstep) that require custom, complex glue logic on x86. And yes, some unix boxes have impressive amount of memory ;)

5) This article speak about *Haswell-EP*. They are one (sometime even two...) order of magnitude cheaper that proprietary unix boxes. So, why on earth in each Xeon article you complain about how mediocre is that technology?

Regards.
Brutalizer - Monday, September 22, 2014 - link
I hear you when you say that x86 has not the best scaleup performance. I am only saying that those 256-socket x86 servers you talk of, are in practice, nothing more than a cluster. Because they are only used for clustered HPC workloads. They will never run Enterprise business software as a large SMP server with 32/64 sockets - that domain are exclusive to Unix/Mainframe servers.

It seems that we disagree on the 256-socket x86 servers, but agree on everything else (x86 are cheaper than RISC, etc). I claim they can only be used as clusters (you will only find HPC cluster benchmarks). So, those large Linux servers with 10.000 cores such as SGI Altix UV, are actually only usable as clusters.

Regarding HP SAP HANA benchmarks with the 16-socket x86 server called ConvergedSystem 9000; it is actually a Unix Superdome server (a RISC server) where HP swapped all Itanium cpus to x86 cpus. Well, it is good that there are soon 16-sockets Linux servers available on the market. But HANA is a clustered database. I would like to see the HP ConvergedSystem server running non clustered Enterprise workloads - how well would the first 16-socket Linux server perform? We have to see. And then we can compare the fresh Linux 16-socket server to the mature 32/64-socket Unix/Mainframe servers in benchmarks and see which is fastest. A clustered Linux 256-socket server sucks on SMP benchmarks, it would be useless.
Brutalizer - Monday, September 22, 2014 - link
http://www.enterprisetech.com/2014/06/02/hps-first...
"...The first of several systems that will bring technologies from Hewlett-Packard’s Superdome Itanium-based machines to big memory ProLiant servers based on Xeon processors is making its debut this week at SAP’s annual customer shindig.

Code-named “Project Kraken,” the system is commercialized as the ConvergedSystem 900 for SAP HANA and as such has been tuned and certified to run the latest HANA in-memory database and runtime environment. The machine, part of a series of high-end shared memory systems collected known as “DragonHawk,” is part of a broader effort by HP to create Superdome-class machines out of Intel’s Xeon processors.
...

The obvious question, with SAP allowing for HANA nodes to be clustered, is: Why bother with a big NUMA setup instead of a cluster? “If you look at HANA, it is really targeting three different workloads,” explains Miller. “You need low latency for transactions, and in fact, you can’t get that over a cluster...."
TiGr1982 - Tuesday, September 9, 2014 - link
Our RISC scale-up evengelist is back!

That's OK and very nice, nobody argues, but I guess one has to win a serious jackpot to afford one of these 32 socket Oracle SPARC M7-based machines :)

Jokes aside, technically, you are correct, but Xeon E5 is obviously not about the very best scale-up on the planet, because Intel is aiming more at a mainstream server market. So, Xeon E5 line resides in a totally different price range than your beasty 32 socket scale-up, so what's the point of writing about SPARC M7 here?
TiGr1982 - Tuesday, September 9, 2014 - link
Talking Intel, even Xeon E7 is much lower class line in terms of total firepower (CPU and RAM capability) than your beloved 32 socket SPARC Mx, and even Xeon E7 is much cheaper, than your Mx-32, so, again, what's the point of posting this in the article about E5?
Brutalizer - Wednesday, September 10, 2014 - link
The point is, people believes that building a huge SMP server with as many as 32-sockets is easy. Just add a few of Xeon E5 and you are good to go. That is wrong. It is exponentially more difficult to build a SMP server than a cluster. So, no one has ever sold such a huge Linux server with 32-sockets. (IBM P795 is a Unix server that people tried to compile Linux for, but it is not Linux server, it is a RISC AIX server)
TiGr1982 - Wednesday, September 10, 2014 - link
Well, I comprehend and understand your message, and I agree with you. Huge SMP scale-up servers are really hard to build, mostly because of the dramatically increasing complexity of the problem to implement the REALLY fast (both in terms of bandwidth and latency) interconnect between sockets in case when socket count grows considerably (say, up to 32), which is really required in order to get the true SMP machine.

I hope, other people get your message too.

BTW, I remember you already posted this kind of statements in the Xeon E7 v2 article comments before :-)
Brutalizer - Monday, September 15, 2014 - link
"...I hope, other people get your message too...."

Unfortunately, they dont. See "shodanshok" reply above, that the 256 socket xeon servers are not clusters. And see my reply, why they are.
bsd228 - Friday, September 12, 2014 - link
Now go price memory for M class Sun servers...even small upgrades are 5 figures and going 4 years back, a mid sized M4000 type server was going to cost you around 100k with moderate amounts of memory.

And take up a large portion of the rack. Whereas you can stick two of these 18 core guys in a 1U server and have 10 of them (180 cores) for around the same sort of money.

Big iron still has its place, but the economics will always be lousy.
platinumjsi - Tuesday, September 9, 2014 - link
ASRock are selling boards with DDR3 support, any idea how that works?

http://www.asrockrack.com/general/productdetail.as...
TiGr1982 - Tuesday, September 9, 2014 - link
Well... ASRock is generally famous "marrying" different gen hardware.
But here, since this is about DDR RAM, governed by the CPU itself (because memory controller is inside the CPU), then my only guess is Xeon E5 v3 may have dual-mode memory controller (supporting either DDR4 or DDR3), similarly as Phenom II had back in 2009-2011, which supported either DDR2 or DDR3, depending on where you plugged it in.

If so, then probably just the performance of E5 v3 with DDR3 may be somewhat inferior in comparison with DDR4.
alpha754293 - Tuesday, September 9, 2014 - link
No LS-DYNA runs? And yes, for HPC applications, you actually CAN have too many cores (because you can't keep the working cores pegged with work/something to do, so you end up with a lot of data migration between cores, which is bad, since moving data means that you're not doing any useful work ON the data).

And how you decompose the domain (for both LS-DYNA and CFD makes a HUGE difference on total runtime performance).
JohanAnandtech - Tuesday, September 9, 2014 - link
No, I hope to get that one done in the more Windows/ESXi oriented review.
Klimax - Tuesday, September 9, 2014 - link
Nice review. Next stop: Windows Server. (And MS-SQL..)
JohanAnandtech - Tuesday, September 9, 2014 - link
Agreed. PCIe Flash and SQL server look like a nice combination to test this new Xeons.
TiGr1982 - Tuesday, September 9, 2014 - link
Xeon 5500 series (Nehalem-EP): up to 4 cores (45 nm)
Xeon 5600 series (Westmere-EP): up to 6 cores (32 nm)
Xeon E5 v1 (Sandy Bridge-EP): up to 8 cores (32 nm)
Xeon E5 v2 (Ivy Bridge-EP): up to 12 cores (22 nm)
Xeon E5 v3 (Haswell-EP): up to 18 cores (22 nm)

So, in this progression, core count increases by 50% (1.5 times) almost each generation.

So, what's gonna be next:

Xeon E5 v4 (Broadwell-EP): up to 27 cores (14 nm) ?

Maybe four rows with 5 cores and one row with 7 cores (4 x 5 + 7 = 27) ?
wallysb01 - Wednesday, September 10, 2014 - link
My money is on 24 cores.
SuperVeloce - Tuesday, September 9, 2014 - link
What's the story with 2637v3? Only 4 cores and the same freqency and $1k price as 6core 2637v2? By far the most pointless cpu on the list.
SuperVeloce - Tuesday, September 9, 2014 - link
Oh, nevermind... I unknowingly caught an error.
JohanAnandtech - Tuesday, September 9, 2014 - link
thx! Fixed. Sorry for the late reaction, jetlagged and trying to get to the hectic pace of IDF :-)
hescominsoon - Tuesday, September 9, 2014 - link
As long as AMD continues it's idiotic two integer units sharing an fpu design they will be an afterthought in the cpu department.
nils_ - Sunday, September 14, 2014 - link
Serious competition for Intel will not come from AMD any time soon, but possibly IBM with the POWER8, Tyan even came out with a single socket board for that CPU so it might make it's way into the same market soon.
ScarletEagle - Tuesday, September 16, 2014 - link
Any feel for the relative HPC performance of the E5-2680v3 with respect to the E5-2650Lv3? I am looking at purchasing a PowerEdge 730 with two of these and the 2133MHz RAM. My guess is that the higher base clock speed should make somewhat of an improvement?

Intel Xeon E5 Version 3: Up to 18 Haswell EP Cores

Post Your Comment

85 Comments

Back to Article

coburn_c - Monday, September 8, 2014 - link

Samus - Monday, September 8, 2014 - link

wireframed - Saturday, September 20, 2014 - link

peevee - Wednesday, March 25, 2015 - link

tech6 - Monday, September 8, 2014 - link

Kevin G - Monday, September 8, 2014 - link

wallysb01 - Monday, September 8, 2014 - link

Samus - Monday, September 8, 2014 - link

Thermogenic - Monday, September 8, 2014 - link

olderkid - Monday, September 8, 2014 - link

MorinMoss - Friday, August 9, 2019 - link

Kevin G - Monday, September 8, 2014 - link

wallysb01 - Monday, September 8, 2014 - link

Kevin G - Monday, September 8, 2014 - link

wallysb01 - Tuesday, September 9, 2014 - link

Kevin G - Friday, September 12, 2014 - link

Kahenraz - Monday, September 8, 2014 - link

JDG1980 - Tuesday, September 9, 2014 - link

Kahenraz - Tuesday, September 9, 2014 - link

Kevin G - Tuesday, September 9, 2014 - link

LostAlone - Saturday, September 20, 2014 - link

Kevin G - Monday, September 8, 2014 - link

SanX - Monday, September 8, 2014 - link

JohanAnandtech - Monday, September 8, 2014 - link

Kevin G - Monday, September 8, 2014 - link

valarauca - Monday, September 8, 2014 - link

Kevin G - Monday, September 8, 2014 - link

DIYEyal - Monday, September 8, 2014 - link

vLsL2VnDmWjoTByaVLxb - Monday, September 8, 2014 - link

JohanAnandtech - Monday, September 8, 2014 - link

cmikeh2 - Monday, September 8, 2014 - link

KAlmquist - Monday, September 8, 2014 - link

isa - Monday, September 8, 2014 - link

Bruce Allen - Monday, September 8, 2014 - link

Ian Cutress - Monday, September 8, 2014 - link

fastgeek - Monday, September 8, 2014 - link

fastgeek - Monday, September 8, 2014 - link

wireframed - Saturday, September 20, 2014 - link

Lone Ranger - Monday, September 8, 2014 - link

JohanAnandtech - Monday, September 8, 2014 - link

martinpw - Monday, September 8, 2014 - link

kepstin - Monday, September 8, 2014 - link

kepstin - Monday, September 8, 2014 - link

julianb - Monday, September 8, 2014 - link

MrSpadge - Tuesday, September 9, 2014 - link

tromp - Monday, September 8, 2014 - link

Hulk - Monday, September 8, 2014 - link

Assimilator87 - Monday, September 8, 2014 - link

iwod - Tuesday, September 9, 2014 - link

Brutalizer - Tuesday, September 9, 2014 - link

Assimilator87 - Tuesday, September 9, 2014 - link

TiGr1982 - Tuesday, September 9, 2014 - link

quixver - Tuesday, September 9, 2014 - link

Brutalizer - Wednesday, September 10, 2014 - link

shodanshok - Tuesday, September 9, 2014 - link

Brutalizer - Wednesday, September 10, 2014 - link

shodanshok - Saturday, September 13, 2014 - link

Brutalizer - Monday, September 15, 2014 - link

shodanshok - Monday, September 15, 2014 - link

Brutalizer - Tuesday, September 16, 2014 - link

shodanshok - Tuesday, September 16, 2014 - link

Brutalizer - Friday, September 19, 2014 - link

shodanshok - Friday, September 19, 2014 - link

Brutalizer - Monday, September 22, 2014 - link

Brutalizer - Monday, September 22, 2014 - link

TiGr1982 - Tuesday, September 9, 2014 - link

TiGr1982 - Tuesday, September 9, 2014 - link

Brutalizer - Wednesday, September 10, 2014 - link

TiGr1982 - Wednesday, September 10, 2014 - link

Brutalizer - Monday, September 15, 2014 - link

bsd228 - Friday, September 12, 2014 - link

platinumjsi - Tuesday, September 9, 2014 - link

TiGr1982 - Tuesday, September 9, 2014 - link

alpha754293 - Tuesday, September 9, 2014 - link

JohanAnandtech - Tuesday, September 9, 2014 - link

Klimax - Tuesday, September 9, 2014 - link