Original Link: https://www.anandtech.com/show/9193/the-xeon-e78800-v3-review



The story behind the high-end Xeon E7 has been an uninterrupted triumphal march for the past 5 years: Intel's most expensive Xeon beats Oracle servers - which cost a magnitude more -  silly, and offers much better performance per watt/dollar than the massive IBM POWER servers.  Each time a new generation of quad/octal socket Xeons is born, Intel increases the core count, RAS features, and performance per core while charging more for the top SKUs. Each time that price increases is justified, as the total cost of a similar RISC server is a factor more than an Xeon E7 server. From the Intel side, this new generation based upon the Haswell core is no different: more cores (18 vs 15), better RAS, slightly more performance per core and ... higher prices. 

However, before you close this tab of your browser, know that even this high-end market is getting (more) exciting. Yes, Intel is correct in that the market momentum is still very much in favor of themselves and thus x86. 

No less than 98% of the server shipments have been "Intel inside". No less than 92-94% of the four socket and higher servers contain Intel Xeons.  From the revenue side, the RISC based systems are still good for slightly less than 20% of the $49 Billion (per year) server market*.  Oracle still commands about 4% (+/- $2 Billion), but has been in a steady decline. IBM's POWER based servers are good for about 12-15% (including mainframes) or $6-7 Billion depending on who you ask (*).  

It is however not game over (yet?) for IBM. The big news of the past months is that IBM has sold its x86 server division to Lenovo. As a result, Big Blue finally throw its enormous weight behind the homegrown POWER chips. Instead of a confusing and half heartly "we will sell you x86 and Itanium too" message, we now get the "time to switch over to OpenPOWER" message. IBM spent $1 billion to encourage ISVs to port x86-linux applications to the Power Linux platform.  IBM also opened up its hardware: since late 2013, the OpenPower Foundation has been growing quickly with Wistron (ODM), Tyan and Google building hardware on top of the Power chips. The OpenPOWER Foundation now has 113 members, and lots of OpenPower servers are being designed and build. Timothy Green of the Motley fool believes OpenPower will threaten Intel's server hegemony in the largest server market, China.   

But enough of that. This is Anandtech, and here we quantify claims instead of just rambling about changing markets. What has Intel cooked up and how does it stack up to the competion? Let's find out.

(*) Source: IDC Worldwide Quarterly Server Tracker, 2014Q1, May 2014, Vendor Revenue Share



The New Xeon E7v3: The Same Xeon Haswell-EP Die...

The "new" Xeon E7 is basically the same Haswell EP die that we talked about in the our Xeon E5 reviews. Spot the only difference between the Haswell EP, 

and the Haswell EX die: 

Indeed, a third QPI link which allows an 8-socket configuration without any special glue logic. The transistor count is the same (5.69 Billion) and that is also true for the massive die size (662 mm²).



Xeon E7 v3 System and Memory Architecture

So, the Xeon E5 "Haswell EP" and Xeon E7 "Haswell EX" are the same chip, but the latter has more features enabled and as result it finds a home in a different system architecture. 

Debuting alongside the Xeon E7 v3 is the new "Jordan Creek 2" buffer chip, which offers support for DDR4 LR-DIMMs or buffered RDIMMs. However if necessary it is still possible to use the original "Jordan Creek" buffer chips with DDR3, giving the Xeon E7 v3 the ability to be used with either DDR3 or DDR4. Meanwhile just like its predecessor, the Jordan Creek 2 buffers can either running in lockstep (1:1) or in performance mode (2:1). If you want more details, read our review of the Xeon E7 v2 or Intel's own comparison

To sum it up, in lockstep mode (1:1): 

  1. The Scalable Memory Buffer (SMB) is working at the same speed as the RAM, max. 1866 MT/s. 
  2. Offers higher availability as the memory subsystem can recover from two sequential RAM failures
  3. Has lower bandwidth as the SMB is running at max. 1866 MT/s 
  4. ...but also lower energy for the same reason (about 7W instead of 9W). 

In performance mode (2:1): 

  1. You get higher bandwidth as the SMB is running at 3200 MT/s (Xeon E7 v2: 2667 MT/s), twice the speed of the memory channels. The SMB combines two memory channels of DDR-4 1600.
  2. Higher energy consumption as the SMB is running at full speed (9W TDP, 2.5 W idle)
  3. The memory subsystem can recover from one device/chip failure as the data can be reconstructed in the spare chip thanks to the CRC chip. 

This is a firmware option, so you chose once whether being able to lose 2 DRAM chips is worth the bandwidth hit. 

Xeon E7 vs E5

The different platform/system architecture is the way that the Xeon E7 differentiates itself from the Xeon E5, all the while both chips have what is essentially the same die. Besides being able to use 4 and 8 socket configurations, the E7 supports much more memory. Each socket connects via Scalable Memory Interconnect 2 (SMI2) to four "Jordan Creek2" memory controllers.

Jordan Creek 2 memory buffers under the black heatsinks with 6 DIMM slots

Each of these memory buffers supports 6 DIMM slots. Multiply four sockets with four memory buffers and six dimm slots and you get a total of 96 DIMM slots. With 64 GB LR-DIMMs (see our tests of Samsung/IDT based LRDIMMs here) in those 96 DIMM slots, you get an ultra expensive server with no less than 6 TB RAM. That is why these system are natural hosts for in-memory databases such as SAP HANA and Microsoft's Hekaton. 

There is more of course. Chances are uncomfortably high that with 48 Trillion memory cells that one of those will go bad, so you want some excellent reliability features to counter that. Memory mirroring is nothing new, but the Xeon E7 v3 allows you to mirror only the critical part of your memory instead of simply dividing capacity by 2. Also new is "multiple rank sparing", which provides dynamic failover of up to four ranks of memory per memory channel. In other words, not can the system shrug off a single chip failure, but even a complete DIMM failure won't be enough to take the system down either. 



Haswell Architecture Improvements

We have discussed the advantages that the Haswell core brings here in more detail. In a nutshell:

  • The core can sustain about 10% more integer instructions per clock cycle than its predecessor, Ivy Bridge. 
  • Virtualized applications should perform slightly better thanks to the lower VM exit/entry latency.
  • HPC applications could/should benefit much more if they are recompiled to make use of the improved AVX2 and Fused Multiply Add (FMA) support
  • Database transactional applications should benefit more thanks to the lower synchronization latency.
  • In-memory databases should benefit if they are adapted to make use of the AVX-2 256 bit integer vector operations.  

Again, the same is true about the Xeon E5-2600v3. So what makes the E7 special? 

Transactional Synchronization Extensions: I'll be back 

There is one "new" - or rather "now working" - feature: TSX or the famous Transactional Synchronization eXtensions. These extensions are all about making locking more "optimistic" (you let the CPU handle the bookkeeping to maintain consistency). TSX is quite powerful, but also can be a liability in the wrong use case. Developers will need a deep understanding of the locking and parallel programming to be able to make good use of TSX, as 

  1. ... you still have to rewrite your code (inserting hints)
  2. TSX may reduce performance in some situations: if indeed a pessimistic lock was necessary, the transaction has to be re-executed with a "traditional" conservative way of locking. You could call it a "lock misprediction".  

Introducing TSX in software requires assessing the different locks in application, using different libraries and quite a bit of of tuning. SAP and Intel did this for the expensive in-memory data mining SAP HANA software.  

 

The upgrade from "Ivy Bridge EX" to "Haswell-EX" yielded 50% performance, while introducing TSX roughly doubled performance. So in TSX enabled data mining software, Haswell-EX has the potential to reduce the waiting time by a factor of 3 and more. 



Xeon E7 v3 SKUs and prices

Intel SKU list has always been complex and very long.  For reference, this is what the Xeon E7 v2 line-up looked like when it launched in 2014:

There is a scalable 2S line that is not scalable beyond 2 sockets, a frequency optimized 8857 which is probably faster in many applications than the 8893 and so on.

Luckily, with the introduction of the Xeon E7 v3, Intel simplified the SKU list. 

First of all, the hardly scalable 2 socket version is gone. And at the low-end, you now get a 8-core processor at 2 GHz instead of a 6-core at 1.9 GHz. Well done, Intel.

The high-end models are all capable of running in 8 socket configurations. But the enterprise people looking for a high-end quad socket system have to pay a bit more: about 8 to 10%.  Most enterprise people will not care, but getting 20% more cores (slightly improved) at 8-10% lower clocks while paying about 8% more is not exactly a vast improvement. Of course these are paper specs, but Intel used to be (a lot) more generous. 

Intel's own slides confirm this. The gains in SPECint2006_rate are pretty small to justify the price increase. Intel claims higher OLTP (TPC-C) increases, but the mentioned gains are rather optimistic. For example, the HammerDB benchmark runs 29% faster on the E7-8890 v3 than on the E7-4890v2. This benchmark is much more transparant, straight forward and easier to reproduce than TPC-C, so we feel it is probably closer to the real world. Secondly, in both cases (HammerDB and TPC-C), the E7-8890v3 had twice as much memory (1 TB vs 512 GB) memory as its predecessor. 

Lastly, these are benchmarks after all. In the real world systems are not running at full speed, so the gains are much smaller.  So it seems that in most applications besides the TSX/AVX2 enabled ones, the gains will be rather small. 



The Competitor: IBM's POWER8

As we briefly mentioned in the introduction, among all of the potential competitors for the Xeon E7 line, IBM's OpenPower might be the most potent competitor at this time.  So how do IBM's offerings compare to Intel's? IBM POWER 8 is a Brainiac (high IPC) design that also wants to be speed demon (high clock speeds).

The POWER8 core can decode, issue and execute and retire 8 instructions per cycle. That degree of of instruction level parallelism (ILP) can not be extracted out of (most) software. To battle the lack of ILP in software, no less than 8 threads (SMT) are active per core.  According to IBM, 

  • 2-threads delivers about 45% performance more than one
  • 4-threads deliver yet another 30% boost
  • the last 4-threads deliver about 7%

So in total, the 8-way SMT doubles the performance of this massive core. Let us compare the two chips. 

Xeon E7v3/POWER8 Comparison
Feature Intel Haswell-EX
​Xeon E7
IBM POWER8
Process tech.  22nm FinFET 22nm SOI
Max clock 2.5-3.6 GHz 3.5-4.35 GHz
Max. core count
Max. thread count
[email protected] GHz
36 SMT
[email protected] GHz
96 SMT
Max. sustained IPC 6 (4) 8
L1-I​ / L1-D Cache 32 KB/32 KB 32 KB/64 KB
L2 Cache 256 KB SRAM per core 512 KB SRAM ​per core
L3 Cache 2.5 MB SRAM per core 8 MB eDRAM ​per core
L4 Cache None 16 MB eDRAM ​per MBC
(64/128 MB total)
Memory 1.5 TB per socket
(64 GB per DIMM)
1-2 TB per socket
(64 GB per DIMM)
Theoretical Memory Bandwidth 102 GB/s
(independent mode)
204 GB/s
PCIe 3.0 Lanes 40 Lanes 32 Lanes

The POWER8 looks better than Haswell-EX in almost every spec, but the devil is of course in the details. First of all, Intel's L2-cache works at the same clock as the core, IBM's L2-cache runs at a lower clock (2.2 GHz or less, depending on the model). Secondly, the POWER8's L3 eDRAM cache might be much larger, but it is so also a bit slower.  

But the main disadvantage of the POWER8 is that all this superscalar wideness and high clockspeed goodness comes with a power price. This slide from Tyan at the latest OpenPOWER conference tell us more. 

A 12 core POWER8 is "limited" to 3.1 GHz if you want to stay below the 190W TDP mark. Clockspeeds higher than 4 GHz are only possible with 8-cores and a 250W TDP. This makes us really curious what kind of power dissipation we may expect from the 4.2 GHz 10-core POWER8 inside the expensive E870 Enterprise systems (300W?).  

That is not all. Each "Jordan Creek2" memory buffer on the Intel system is limited to about 9W. IBM uses a similar but more complex "Centaur" memory buffer (including a 16 MB cache) which needs more than twice as much energy (16-20W). There are at least four of them per chip, and a high-end chip can have eight. So in total the Intel CPU plus memory buffers have a 201W TDP (165W CPU + 4x9W Jordan Creek 2), while the IBM platform has at best a 270W TDP (190W CPU+ 4x20W MBC).



POWER8 Servers: The Reality Check

As we've just seen, the specs of the POWER8 as announced at launch are very impressive. But what about the in the real world? The top models (10-12 at 4 GHz+, 2TB per socket) are still limited to the extremely expensive E870/E880, which typically costs around 3 times as much (or more) as a comparable Xeon E7 system. But there is light at the end of the tunnel: "PowerLinux" quad socket systems are more expensive than comparable x86 systems, but only by 10 to 30%.

The real competition for x86 must probably come from the third parties of the OpenPower Fondation. IBM sells them POWER8 chips at much more reasonable prices ($2k - $3k), so it is possible to build a reasonably priced POWER8 system. The POWER8 chips sold to third parties are somewhat "lighter" versions, but that is more an advantage than you would think. For example, by keeping the clockspeed a bit lower, the power consumption is lower (190W TDP). These chips also have only 4 (instead  of 8) memory buffer chips, which "limits" them to 1 TB of memory, but again this saves quite a bit of power, between 50W and 80W. In other words, the POWER8 chips available to third parties are much more reasonable and even more competitive than the power gobbling, ultra expensive behemoths that got all the attention at launch.  

Tyan already has an one socket server and several Taiwanese (Wistron) and Chinese vendors are developing 2 socket systems. Quad socket models are not yet on the horizon as far as we know, but is probably going to change soon.   

POWER8 vs. Xeon E5 v3: SPECing It Out

Unfortunately we did not have access to a full blown POWER8 system at this time. But as our loyal readers know, we do not limit our server testing to the x86 world (see here and here) . So until a POWER8 system arrives, we'll have to check out the available industry standard benchmarks. To that end we looked up the SPEC CPU2006 numbers for a single socket CPU. 

SPEC CPU2006 - One chip

The 12 cores inside the POWER8 - the single socket chips found in the more reasonable priced servers - perform very well. The integer performance is only a few percentages lower than the Intel chip and POWER8's floating point performance is well ahead of the Xeon.

Overall the POWER8 is quite capable of keeping up with the Xeon E5-2699v3. And don't let the "2.3 GHz" official clockspeed fool you into thinking that the Xeons are clocked unnecessarily low, either: in SPECint, the XEON is running at 2.8 GHz most of the time.

Ultimately, the POWER8 is able to offer slightly higher raw performance than the Intel CPUs, however it just won't be able to do so at the same performance/watt. Meanwhile the reasonable pricing of the POWER8 chips should result in third party servers that are strongly competitive with the Xeon on a performance-per-dollar basis. Reasonably priced, well performing dual and quad socket Linux on Power servers should be possible very soon.



Benchmark Configuration

As far as reliability is concerned, while we little reason to doubt that the quad Xeon OEM systems out there are the pinnacle of reliability, our initial experience with Xeon E7 v3 has not been as rosy. Our updated and upgraded Quad Xeon Brickland system was only finally stable after many firmware updates, with its issues sorted out just a few hours before the launch of the Xeon E7 v3. Unfortunately this means our time testing the stable Xeon E7 v3 was a bit more limited than we would have liked.

Meanwhile to make the comparison more interesting, we decided to include both the Quad Xeon "Westmere-EX" as well as the "Nehalem-EX". Remember these heavy duty, high RAS servers continue to be used for much longer in the data center than their dual socket counterparts, 5 years or more are no exception. Of course, the comparison would not be complete without the latest dual Xeon 2699 v3 server.

All testing has been done on 64 bit Ubuntu Linux 14.04 (kernel 3.13.0-51, gcc version 4.8.2).

Intel S4TR1SY3Q "Brickland" IVT-EX 4U-server

The latest and greatest from Intel consists of the following components:

CPU 4x Xeon E7-8890v3 2.5 GHz 
18c, 45 MB L3, 165W TDP

or

4x Xeon E7-4890 v2 (D1 stepping) 2.8GHz
15 cores, 37.5MB L3, 155W TDP
RAM 256 GB, 32x 8 GB Micron  DDR-4-2100
at 1600MHz

or

256 GB, 32x8GB Samsung 8GB DDR3
M393B1K70DH0-YK0 at 1333MHz
Motherboard Intel CRB Baseboard "Thunder Ridge"
Chipset Intel C602J
PSU 2x1200W (2+0)

Total amount of DIMM slots is 96. When using 64GB LRDIMMs, this server can offer up to 6TB of RAM.

If only two cores are active, the 8890 can boost the clockspeed to 3.3 GHz (AVX code: 3.2 GHz). The 4890v2 reaches 3.4 GHz in that situation. Even with all cores active, 2.9 GHz is possible (AVX code: 2.6 GHz).

Intel Quanta QSCC-4R Benchmark Configuration

The previous quad Xeon E7 server, as reviewed here.

CPU 4x Xeon X7560 at 2.26GHz, or
4x Xeon E7-4870 at 2.4GHz
RAM 16x8GB Samsung 8GB DDR3
M393B1K70DH0-YK0 at 1066MHz
Motherboard QCI QSSC-S4R 31S4RMB00B0
Chipset Intel 7500
BIOS version QSSC-S4R.QCI.01.00.S012,031420111618
PSU 4x850W Delta DPS-850FB A S3F E62433-004 850W

The server can accept up to 64 32GB Load Reduced DIMMs (LR-DIMMs) or 2TB.

Intel's Xeon E5 Server – "Wildcat Pass" (2U Chassis)

Finally, we have our Xeon E5 v3 server:

CPU Two Intel Xeon processor E5-2699 v3 (2.3GHz, 18c, 45MB L3, 145W)
RAM 128GB (8x16GB) Samsung M393A2G40DB0 (RDIMM)
Internal Disks 2x Intel MLC SSD710 200GB
Motherboard Intel Server Board Wilcat pass
Chipset Intel Wellsburg B0
BIOS version August the 9th, 2014
PSU Delta Electronics 750W DPS-750XB A (80+ Platinum)

Every server was outfitted with two 200 GB S3700 SSDs.



SAP S&D Benchmark

The SAP SD (Sales and Distribution, 2-Tier Internet Configuration) benchmark is an interesting benchmark as it is a real-world client-server application. It is one of those rare industry benchmarks that actually means something to the real IT professionals. Even better, the SAP ERP software is a prime example of where these Xeon E7 v2 chips will be used. We looked at SAP's benchmark database for these results.

Most of the results below all run on Windows 2008/2012 and MS SQL Server (both 64-bit). Every 2-Tier Sales & Distribution benchmark was performed with SAP's latest ERP 6 Enhancement Package 4. We analyzed the SAP Benchmark in-depth in one of our earlier articles. The profile of the benchmark has remained the same:

  • Very parallel resulting in excellent scaling
  • Low to medium IPC, mostly due to "branchy" code
  • Somewhat limited by memory bandwidth
  • Likes large caches (memory latency)
  • Very sensitive to sync ("cache coherency") latency

Let's see how the quad Xeon compares to the previous Intel generation, the cheaper dual socket systems, and the RISC competition.

SAP Sales & Distribution 2 Tier benchmark

When we said that the competition in the high-end market was heating up, we were not kidding. The dual socket (24-core) S824 beats the dual socket Xeon E5 by a large margin (+35%), despite the latter having 50% more cores (36 vs 24).

At IBM's website, this server is priced at $65k, but the actual street prices are around $35k, slightly below what a typical similar quad Xeon costs (around $40k) .Of course, IBM should make it easier for small enterprises to get their hardware quickly at a decent price. But this shows that it is not impossible that POWER servers can become an alternative to the typical x86 systems... just not from IBM's webstore. The POWER8 system might be somewhat cheaper to acquire than the HP DL580 Gen9, but that Intel system is still almost 40% faster, so IBM is not an alternative quite yet. Then again, IBM is a lot more competitive than a few years ago. The S824 is not that far behind the Quad Xeon E7 v2, so it is a good thing that the new Xeon E7 offers about 20% better performance than the latter.

So who is on the top of server foodchain?

SAP Sales & Distribution 2 Tier - 8+ Socket systems

They might be power hungry, but the new POWER8 has made the Enterprise line of IBM more competitive than ever. Gone are the days that IBM needed more CPU sockets than Intel to get the top spot. Nevertheless, it should be noted that you can get several 8-socket Xeon systems for the price of one IBM E870 enterprise server.



Memory Subsystem: Bandwidth

Let's set the stage first and perform some meaningful low level benchmarks. First, we measured the memory bandwidth in Linux. The binary was compiled with the Open64 compiler 5.0 (Opencc). It is a multi-threaded, OpenMP based, 64-bit binary. The following compiler switches were used:

-Ofast -mp -ipa

The results are expressed in GB per second. Note that we also tested with gcc 4.8.1 and compiler options

-O3 –fopenmp –static

Results were consistently 20% to 30% lower with gcc, so we feel our choice of Open64 is appropriate. Everybody can reproduce our results (Open64 is freely available) and since the binary is capable of reaching higher speeds, it is easier to spot speed differences.

Stream Triad

We get a 13% increase in bandwidth, which is a result of the faster SMI interface (3.2 GT/s instead 2.67 GT/s). Intel did a good job here: it is not easy to keep bandwidth scaling with the socket count. 



Single-threaded Integer Performance: 7-Zip

The profile of a compression algorithm is somewhat similar to many server workloads: it can be hard to extract instruction level parallelism (ILP) and it's sensitive to memory parallelism and latency. The instruction mix is a bit different, but it's still somewhat similar to many server workloads. Testing single threaded is also a great way to check how well the turbo boost feature works in a CPU.

And as one more reason to test performance in this manner, the 7-zip source code is available under the GNU LGPL license. That allows us to recompile the source code on every machine with the -O2 optimization with gcc 4.8.2.

We added the 7-zip scores that we could find at the 7-zip benchmark page. But there is more. The numbers on the 7-zip bench page have no software details, so we could not be sure that they would be accurate. So we managed to get a brief session on a POWER8 "for development purposes" server. The hardware specs can be read below:

Yes, we only got access to 1 core (8 threads) and 2 GB of RAM. So real world server benchmarking was out of the question. Nevertheless, it's a start. To that end we tested with gcc 4.9.1 (supports POWER8) and recompiled our source with the "-O2 -mtune="power8" options on Ubuntu Linux 14.10 for POWER. 

LZMA Single-Threaded Performance: Compression

Let us first focus on the new Haswell core inside the Xeon E7, which offers a solid 10% improvement. Turbo boost brings the clockspeed of the Haswell core close enough to the Ivy Bridge core (3.3GHz vs 3.4GHz) and the improved core does the rest. Nevertheless, it is clear that we should not expect huge performance increases with a 10% faster core and 20% more cores.

Back to the more exciting stuff: the fight between Intel and IBM, between the Xeon "Haswell" and the POWER8 chip. The Haswell core is a lot more sophisticated: single threaded performance at 3.3 GHz (turbo) is no less than 50% higher than the POWER8 at 3.4 GHz. That means that the Haswell core is a lot more capable when it comes to extracting ILP out of that complex code.

However, when the IBM monster is allowed to use 8 simultaneous threads spread out over one core, something magical happens. Something that we have not seen in a long, long time: the Intel chip is no longer on top. When you use all the available threading resources in one core, the 3.4 GHz chip is a tiny bit (2%) faster than the best Intel Xeon at 3.3 GHz.



7-Zip Decompression

Up next, let's see how the chips compare in decompression. Decompression is an even lower IPC workload, as it is very branch intensive and depends on the latencies of the multiply and shift instructions.

LZMA Single-Threaded Performance: Decompression

The slightly higher clock of "Ivy Bridge EX" is enough to keep up with "Haswell EX".

Meanwhile, once again the Haswell core proves to be a bit more capable. It sustains 20% higher IPC with one thread. But run 8 threads inside the most powerful RISC core ever, and the POWER8 beats the XEON E7 by a massive margin: it is almost 50% (!) faster. Wow. Don't believe it? see below.

Now, in defense of Intel, decompression has an exotic instruction mix. You should optimize for the common case, not for exotic software. So we were told by the RISC vendors 30 years ago...

Want more POWER8 benchmarks? Unfortunately we'll have to dissapoint you. The limited server we tested on was not able to run any of our server workloads as we only had one core and less than 2 GB to work with.



Multi-Threaded Integer Performance

While compression and decompression are not real world benchmarks (at least as far as servers go), more and more servers have to perform these tasks as part of a larger role (e.g. database compression, website optimization). Let's now enable multi-threaded workloads and see if the mix of a slightly better core, a decent turbo boost (up to 2.9 GHz) and slightly more cores (18 vs 15) is enough.

LZMA Performance: compression

We can conclude that the new Xeon E7 has about 35% more integer crunching power.

LZMA Performance: Decompression

Decompression scales better than compression with more cores, but the difference between the new Xeon E7 v3 and the older E7 v2 not very large (12%).



Linux Kernel Compile

A more real world benchmark to test the integer processing power of our Xeon servers is a Linux kernel compile. Although few people compile their own kernel, compiling other software on servers is a common task and this will give us a good idea of how the CPUs handle a complex build.

To do this we have downloaded the 3.11 kernel from kernel.org. We then compiled the kernel with the "time make -jx" command, where x is the maximum number of threads that the platform is capable of using. To make the graph more readeable, the number of seconds in wall time was converted into the number of builds per hour.

Kernel Compile

The new Xeon E7 server is no less than 3 times faster than the Xeon servers it is likely to replace in the datacenter (the ones based upon the Xeon X7560). Meanwhile the performance advantage over the previous generation is 22%, which is noticeable, but hardly spectacular. The Quad Xeon E7 is however almost twice as fast than the dual Xeon E5. In other words, RAS features are not the only differentiator with the E5 line, raw speed is that too.



HPC: OpenFoam

Computational Fluid Dynamics is a very important part of the HPC world. Several readers told us that we should look into OpenFoam, and my lab was able to work with the professionals of Actiflow. Actiflow specializes in combining aerodynamics and product design. Calculating aerodynamics involves the use of CFD software, and Actiflow uses OpenFoam to accomplish this. To give you an idea what these skilled engineers can do, they worked with Ferrari to improve the underbody airflow of the Ferrari 599 and increase its downforce.

We were allowed to use one of their test cases as a benchmark, but we are not allowed to discuss the specific solver. All tests were done on OpenFoam 2.2.1 and openmpi-1.6.3.

Many CFD calculations do not scale well on clusters, that is unless you use InfiniBand. InfiniBand switches are quite expensive and even then there are limits to scaling. We do not have an InfiniBand switch in the lab, unfortunately. Although it's not as low latency as InfiniBand, we do have a good 10G Ethernet infrastructure, which performs rather well. So we can compare our newest Xeon server with a basic cluster.

We also found AVX code inside OpenFoam 2.2.1, so we assume that this is one of the cases where AVX improves FP performance.

OpenFoam Benchmark

Unless you recompile and tune your code for AVX2, the new Xeon E7 v3 is hardly faster than the previous one. The reason may be that the new Xeon can sometimes go as low as 2.1 GHz running AVX code due to the immense power load AVX2 workloads can cause, while the older E7 v2 is capable of sustaining 2.8 GHz.



HPC: Watts per Job

Last, but not least, we have a look at power consumption. First we measure idle power consumption.

Idle Power Consumption,

We did not expect the E7 v3 to consume more energy at idle than the previous E7, but sure enough it did. Maybe the DDR4 memory buffers (Jordan Creek 2) need more energy than the previous ones?

For load power testing we used the OpenFOAM test and measured at the 95th percentile, which is basically the power consumed when processing the most parallel part.

HPC Power Consumption - 95 th percentile

These quad socket systems are made for reliability, and not quite as much as for performance-per-watt. The end result is that these quad socket servers need about as much power as your fabric iron. To put this in perspective: the Xeon E5-2699v3 is considered a real power hog among the Xeon E5s. Most of the other dual Xeon E5 servers are in the 390-450W range.

Let us see how much watt we need for each OpenFOAM job.

Total HPC Energy Consumption per Job

The new Xeon E7-8890v3 is a tiny bit more efficient, but it is almost neglible.



Intel's Benchmarks

Since time constraints meant that we were not able to run a ton of benchmarks ourselves, it's useful to check out Intel's own benchmarks as well. In our experience Intel's own benchmarking has a good track record for producing accurate numbers and documenting configuration details. Of course, you have to read all the benchmarking information carefully to make sure you understand just what is being tested.

The OLTP and virtualization benchmarks show that the new Xeon E7 v3 is about 25 to 39% faster than the previous Xeon E7 (v2). In some of those benchmarks, the new Xeon had twice as much memory, but it is safe to say that this will make only a small difference. We think it's reasonable to conclude that the Xeon E7 is 25 to 30% faster, which is also what we found in our integer benchmarks.

The increase in legacy FP application is much lower. For example Cinebench was 14% faster, SPECFP 9% and our own OpenFOAM was about 4% faster. Meanwhile linpack benchmarks are pretty useless to most of the HPC world, so we have more faith in our own benchmarking. Intel's own realistic HPC benchmarking showed at best a 19% increase, which is nothing to write home about.

The exciting part about this new Xeon E7 is that data analytics/mining happens a lot faster on the new Xeon E7 v3. The 72% faster SAS analytics number is not really accurate as part of the speedup was due to using P3700 SSDs instead of the S3700 SSD. Still, Intel claims that the replacing the E7 v2 with the v3 is good for a 55-58% speedup.

The most spectacular benchmark is of course SAP HANA. It is not 6x faster as Intel claims, but rather 3.3x (see our comments about TSX). That is still spectacular and the result of excellent software and hardware engineering.

Final Words: Comparing Xeon E7 v3 vs V2

For those of us running scale-up, reasonably priced HPC or database applications, it is hard to get excited about the Xeon E7 v3. The performance increases are small-but-tangible, however at the same time the new Xeon E7 costs a bit more. Meanwhile as far as our (HPC) energy measurements go, there is no tangible increase in performance per watt.

The Xeon E7 in its natural habitat: heavy heatsinks, hotpluggable memory

However organizations running SAP HANA will welcome the new Xeon E7 with open arms, they get massive speedups for a 0.1% or less budget increase. The rest of the data mining community with expensive software will benefit too, as the new Xeon E7 is at least 50% faster in those applications thanks to TSX.

Ultimately we wonder how the rest of us will fare. Will SAP/SAS speedups also be visible in open source Big Data software such as Hadoop and Elastic Search? Currently we are still struggling to get the full potential out of the 144 threads. Some of these tests run for a few days only to end with a very vague error message: big data benchmarking is hard.

Comparing Xeon E7 v3 and POWER8

Although the POWER8 is still a power gobbling monster, just like its older brother the POWER7, there is no denying that IBM has made enormous progress. Few people will be surprised that IBM's much more expensive enterprise systems beat Intel based offerings in the some high-end benchmarks like SAP's. But the fact that 24 POWER8 cores in a relatively reasonably priced IBM POWER8 server can beat 36 Intel Haswell cores by a considerable margin is new.

It is also interesting that our own integer benchmarking shows that the POWER8 core is capable of keeping up with Intel's best core at the same clockspeed (3.3-3.4 GHz). Well, at least as long as you feed it enough threads in IPC unfriendly code. But that last sentence is the exact description of many server workloads. It also means that the SAP benchmark is not an exception: the IBM POWER8 is definitely not the best CPU to run Crysis (not enough threads) but it is without a doubt a dangerous competitor for Xeon E7 when given enough threads to fill up the CPU.

Right now the threat to Intel is not dire, IBM still asks way too much for its best POWER8 systems and the Xeons have a much better performance-per-watt ratio. But once the OpenPOWER fondation partners start offering server solutions, there is a good chance that Intel will receive some very significant performance-per-dollar competition in the server market.

Log in

Don't have an account? Sign up now