Name: Intel Haswell-EP Xeon 14 Core Review: E5-2695 V3 and E5-2697 V3
Item: Intel Haswell-EP Xeon 14 Core Review: E5-2695 V3 and E5-2697 V3
Author: Dr. Ian Cutress

Original Link: https://www.anandtech.com/show/8730/intel-haswellep-xeon-14-core-review-e52695-v3-and-e52697-v3

Intel Haswell-EP Xeon 14 Core Review: E5-2695 V3 and E5-2697 V3

VIEW ARTICLE

by Ian Cutress on November 20, 2014 10:00 AM EST

44 Comments

Moving up the Xeon product stack, the larger and more complicated the die, the lower the yield. Intel sells its 14-18 core Xeons from a top end design that weighs in at over five billion transistors, and we have had two of the 14C models in for review: the E5-2695 V3 (2.3 GHz, 3.3 GHz turbo) and E5-2697 V3 (2.6 GHz, 3.6 GHz turbo).

The Information

It can only take one particular transistor to fail to break a whole CPU. If it happens in a core as part of the logic or caches, that core can be fused off and the die can sold as a lower core part. This is how yields are improved, by reusing the dies that have errors in removable sections. Ultimately this reduces the maximum amount of profit on offer, but it enables CPU manufacturers like Intel and AMD to sell a range of products, rather than just one from a single design. The way Intel designs its high end E5 V3 Xeons, from an 18-core die, means that its 14 core components either have at least two defects, or are perfectly fine 18 core models but need to fill up demand.

CPU Specification Comparison
	CPU	Node	Cores	GPU	Transistor Count (Schematic)	Die Size
Server CPUs
Intel	Haswell-EP 14-18C	22nm	14-18	N/A	5.69B	662mm²
Intel	Haswell-EP 10C-12C	22nm	6-12	N/A	3.84B	492mm²
Intel	Haswell-EP 6C-8C	22nm	4-8	N/A	2.6B	354mm²
Intel	Ivy Bridge-EP 12C-15C	22nm	10-15	N/A	4.31B	541mm²
Intel	Ivy Bridge-EP 10C	22nm	6-10	N/A	2.89B	341mm²
Consumer CPUs
Intel	Haswell-E 8C	22nm	8	N/A	2.6B	356mm²
Intel	Haswell GT2 4C	22nm	4	GT2	1.4B	177mm²
Intel	Haswell ULT GT3 2C	22nm	2	GT3	1.3B	181mm²
Intel	Ivy Bridge-E 6C	22nm	6	N/A	1.86B	257mm²
Intel	Ivy Bridge 4C	22nm	4	GT2	1.2B	160mm²
Intel	Sandy Bridge-E 6C	32nm	6	N/A	2.27B	435mm²
Intel	Sandy Bridge 4C	32nm	4	GT2	995M	216mm²
Intel	Lynnfield 4C	45nm	4	N/A	774M	296mm²
AMD	Trinity 4C	32nm	4	7660D	1.303B	246mm²
AMD	Vishera 8C	32nm	8	N/A	1.2B	315mm²

I mentioned in the 12 core review that Intel can play fast and loose with their binning process, giving customers almost what they desire in terms of performance and power, as long as they are willing to pay that price. The same could be said for the 14-18 core market, but rather than offer a swathe of units, Intel offers around half a dozen ranging from a 2.0 GHz 14-core to the E5-2699 V3 2.3 GHz 18-core. Intel could release a 65W, 18 core monster, and even though it might come through at 1.2 GHz, this type of SKU is not on the roadmap (unless, perhaps, you meet the high minimum order quantity). If given the opportunity, I would like to see the process by which Intel decides to select which SKUs to bin for retail vs. OEM and custom parts. I suspect it is a combination of part market demand, part yield, part wishful thinking, but I would hope it is at least systematic. Based on the core orientation image below, there might be complications dealing with that final column of six cores, against the other columns of four, either in voltage response characteristics or discrete production errors which might also have another effect.

Our samples in today come in with the E5-2695 V3 at 2.3 GHz base frequency (3.3 GHz turbo) and the E5-2697 V3 at 2.6 GHz (3.6 GHz turbo). When considering the Xeon naming stack, each number from 2695 to 2699 is taken except from 2696, and as such one might humorously postulate that Intel is merely running out of SKU names. Though an added L or W might find its way in if more models joined the list.

In our last test, as well as previous reviews, the results showed that a 2P system, such as the dual E5-2650L V3s, performed poorly in most of our testing software compared to having one big single CPU in a 1P socket in most circumstances. The 1P arrangement tends to outperform a 2P system when the software is not built to take advantage of the NUMA arrangement. Intel does sell CPUs like the E5-1691 V3, a 14 core chip for 1P systems, or we can go straight into the E5-2699 V3 for 18 cores, but there will always be a market for 2P players who need the large memory capacity or who use software similar to Cinema 4D that is NUMA aware.

Intel Xeon E5 2600 v3 SKU Comparison
Xeon E5	Cores/ Threads	TDP	Clock Speed (GHz) Base - Turbo	Price
High Performance (35-45MB LLC)
2699 v3	18/36	145W	2.3-3.6	$4115
2698 v3	16/32	135W	2.3-3.6	$3226
2697 v3	14/28	145W	2.6-3.6	$2702
2695 v3	14/28	120W	2.3-3.3	$2424
"Advanced" (20-30MB LLC)
2690 v3	12/24	135W	2.6-3.5	$2090
2685 v3	12/24	120W	2.6-3.5	$2090
2680 v3	12/24	120W	2.5-3.3	$1745
2660 v3	10/20	105W	2.6-3.3	$1445
2658 v3 (E)	12/24	105W	2.2-2.9	$1832
2650 v3	10/20	105W	2.3-3.0	$1167
Midrange (15-25MB LLC)
2640 v3	8/16	90W	2.6-3.4	$939
2630 v3	8/16	85W	2.4-3.2	$667
2620 v3	6/12	85W	2.4-3.2	$422
Frequency optimized (10-20MB LLC)
2687W v3	10/20	160W	3.1-3.5	$2141
2667 v3	8/16	135W	3.2-3.6	$2057
2643 v3	6/12	135W	3.4-3.7	$1552
2637 v3	4/8	135W	3.5-3.7	$996
Budget (15MB LLC)
2609 v3	6/6	85W	1.9	$306
2603 v3	6/6	85W	1.6	$213
Power Optimized (20-30MB LLC)
2650L v3	12/24	65W	1.8-2.5	$1329
2648L v3 (E)	12/24	75W	1.8-2.5	$1544
2630L v3	8/16	55W	1.8-2.9	$612

The big cores get a big power budget and a big price to match. The movement from the 2695 to the 2697 is only a few hundred MHz, but Intel charges and additional $278 for the privilege with a rise in 25 TDP. In terms of frequency response both of the CPUs follow the same path, marking an extra 300 MHz for the difference in power and price.

If we did some basic 24/365 100% use calculations, using the TDP and $0.10/kWh, The 2697 V3 would consume 1270 kWh and cost $127/yr compared to the 2695 V3 which would consume 1050 kWh and cost $105/yr. This is obviously not including any additional cooling needed, but the $22 difference in power per year against $278 in the CPU price difference would indicate 15 years of running to make up the difference. Clearly the cost per CPU matters more regarding how much work is going to be done per unit time. If the contract takes less time to complete, then it can sway the preference in terms of the faster CPU if the contract is CPU compute or response bound.

As this the third in our recent series of Xeon E5-2600 v3 performance coverage, we have covered most of the technical data in our previous two installments regarding 10 core and 12 core performance. We carry over the data from those tests, but refer back for details regarding chipset and DRAM information, as well as Johan’s extensive review covering in depth more of the server-focused aspects of the Xeon E5 v3 design.

Test Setup

As with the previous reviews, due to the timing available to test each of our CPU samples we were only able to get a limited range of E5-2695 V3 benchmark results. However, we were able to source two E5-2697 V3 CPUs for dual 14-core analysis leading to a 56-thread behemoth.

Test Setup
Processor	Intel Xeon E5-2695 V3 (120W), 14C/28T, 2.3 GHz (3.3 GHz Turbo) Intel Xeon E5-2697 V3 (145W), 14C/28T, 2.6 GHz (3.6 GHz Turbo)
Motherboards	ASUS X99-Deluxe ASRock X99 Extreme6 GIGABYTE MD60-SC0
Cooling	Cooler Master Nepton 140XL Dynatron R14
Power Supply	OCZ 1250W Gold ZX Series Corsair AX1200i Platinum PSU
Memory	ADATA XPG Z1 DDR4-2400 8x8 GB 1.2V Corsair DDR4-2133 C15 4x8 GB 1.2V G.Skill Ripjaws 4 DDR4-2133 C15 4x8 GB 1.2V
Memory Settings	JEDEC @ 2133
Video Cards	AMD R7 240 DDR3
Video Drivers	AMD Catalyst 13.11
Hard Drive	OCZ Vertex 3 256GB
Optical Drive	LG GH22NS50
Case	Open Test Bed
Operating System	Windows 7 64-bit SP1

Many thanks to...

We must thank the following companies for kindly providing hardware for our test bed:

Thank you to OCZ for providing us with PSUs and SSDs.
Thank you to G.Skill for providing us with memory.
Thank you to Corsair for providing us with an AX1200i PSU.
Thank you to MSI for providing us with the NVIDIA GTX 770 Lightning GPUs.
Thank you to Rosewill for providing us with PSUs and RK-9100 keyboards.
Thank you to ASRock for providing us with some IO testing kit.
Thank you to Cooler Master for providing us with Nepton 140XL CLCs.
Thank you to GIGABYTE Server for loaning us some CPUs and Dynatron CPU coolers.

Load Delta Power Consumption

Power consumption was tested on the system while in a single MSI GTX 770 Lightning GPU configuration with a wall meter connected to the OCZ 1250W power supply. This power supply is Gold rated, and as I am in the UK on a 230-240 V supply, leads to ~75% efficiency > 50W, and 90%+ efficiency at 250W, suitable for both idle and multi-GPU loading. This method of power reading allows us to compare the power management of the UEFI and the board to supply components with power under load, and includes typical PSU losses due to efficiency.

We take the power delta difference between idle and load as our tested value, giving an indication of the power increase from the CPU when placed under stress.

Power Consumption Delta: Idle to AVX

Professional Performance: Windows

Agisoft Photoscan – 2D to 3D Image Manipulation: link

Agisoft Photoscan creates 3D models from 2D images, a process which is very computationally expensive. The algorithm is split into four distinct phases, and different phases of the model reconstruction require either fast memory, fast IPC, more cores, or even OpenCL compute devices to hand. Agisoft supplied us with a special version of the software to script the process, where we take 50 images of a stately home and convert it into a medium quality model. This benchmark typically takes around 15-20 minutes on a high end PC on the CPU alone, with GPUs reducing the time.

Agisoft PhotoScan Benchmark - Total Time

Cinebench R15

Cinebench R15 - Single Threaded

Cinebench R15 - Multi-Threaded

Professional Performance: Linux

Built around several freely available benchmarks for Linux, Linux-Bench is a project spearheaded by Patrick at ServeTheHome to streamline about a dozen of these tests in a single neat package run via a set of three commands using an Ubuntu 14.04 LiveCD. These tests include fluid dynamics used by NASA, ray-tracing, molecular modeling, and a scalable data structure server for web deployments. We run Linux-Bench and have chosen to report a select few of the tests that rely on CPU and DRAM speed.

C-Ray: link

C-Ray is a simple ray-tracing program that focuses almost exclusively on processor performance rather than DRAM access. The test in Linux-Bench renders a heavy complex scene offering a large scalable scenario.

Linux-Bench c-ray 1.1 (Hard)

NAMD, Scalable Molecular Dynamics: link

Developed by the Theoretical and Computational Biophysics Group at the University of Illinois at Urbana-Champaign, NAMD is a set of parallel molecular dynamics codes for extreme parallelization up to and beyond 200,000 cores. The reference paper detailing NAMD has over 4000 citations, and our testing runs a small simulation where the calculation steps per unit time is the output vector.

Linux-Bench NAMD Molecular Dynamics

NPB, Fluid Dynamics: link

Aside from LINPACK, there are many other ways to benchmark supercomputers in terms of how effective they are for various types of mathematical processes. The NAS Parallel Benchmarks (NPB) are a set of small programs originally designed for NASA to test their supercomputers in terms of fluid dynamics simulations, useful for airflow reactions and design.

Linux-Bench NPB Fluid Dynamics

Redis: link

Many of the online applications rely on key-value caches and data structure servers to operate. Redis is an open-source, scalable web technology with a strong developer base, but also relies heavily on memory bandwidth.

Linux-Bench Redis Memory-Key Store, 100x

Linux-Bench Redis Memory-Key Store, 10x

Linux-Bench Redis Memory-Key Store, 1x

CPU Benchmarks

The dynamics of CPU Turbo modes, both Intel and AMD, can cause concern during environments with a variable threaded workload. There is also an added issue of the motherboard remaining consistent, depending on how the motherboard manufacturer wants to add in their own boosting technologies over the ones that Intel would prefer they used. In order to remain consistent, we implement an OS-level unique high performance mode on all the CPUs we test which should override any motherboard manufacturer performance mode.

HandBrake v0.9.9: link

For HandBrake, we take two videos (a 2h20 640x266 DVD rip and a 10min double UHD 3840x4320 animation short) and convert them to x264 format in an MP4 container. Results are given in terms of the frames per second processed, and HandBrake uses as many threads as possible.

HandBrake v0.9.9 LQ Film

HandBrake v0.9.9 2x4K

Dolphin Benchmark: link

Many emulators are often bound by single thread CPU performance, and general reports tended to suggest that Haswell provided a significant boost to emulator performance. This benchmark runs a Wii program that raytraces a complex 3D scene inside the Dolphin Wii emulator. Performance on this benchmark is a good proxy of the speed of Dolphin CPU emulation, which is an intensive single core task using most aspects of a CPU. Results are given in minutes, where the Wii itself scores 17.53 minutes.

Dolphin Emulation Benchmark

WinRAR 5.0.1: link

WinRAR 5.01, 2867 files, 1.52 GB

PCMark8 v2 OpenCL

A new addition to our CPU testing suite is PCMark8 v2, where we test the Work 2.0 and Creative 3.0 suites in OpenCL mode.

PCMark8 v2 Work 2.0 OpenCL with R7 240 DDR3

Hybrid x265

Hybrid is a new benchmark, where we take a 4K 1500 frame video and convert it into an x265 format without audio. Results are given in frames per second.

Hybrid x265, 4K Video

3D Particle Movement

3DPM is a self-penned benchmark, taking basic 3D movement algorithms used in Brownian Motion simulations and testing them for speed. High floating point performance, MHz and IPC wins in the single thread version, whereas the multithread version has to handle the threads and loves more cores.

3D Particle Movement: Single Threaded

3D Particle Movement: MultiThreaded

FastStone Image Viewer 4.9

FastStone is the program I use to perform quick or bulk actions on images, such as resizing, adjusting for color and cropping. In our test we take a series of 170 images in various sizes and formats and convert them all into 640x480 .gif files, maintaining the aspect ratio. FastStone does not use multithreading for this test, and results are given in seconds.

FastStone Image Viewer 4.9

Web Benchmarks

On the lower end processors, general usability is a big factor of experience, especially as we move into the HTML5 era of web browsing. For our web benchmarks, we take four well known tests with Chrome 35 as a consistent browser.

Sunspider 1.0.2

Sunspider 1.0.2

Mozilla Kraken 1.1

Kraken 1.1

WebXPRT

WebXPRT

Google Octane v2

Google Octane v2

Gaming Benchmarks

While the last thought on the minds of most Xeon users is related to gaming, we frequently get requests to test gaming performance on Xeons. As a result we strap the Xeon to a regular consumer level motherboard that can support them and add in one or two GPUs to see how they perform and if more cores makes a difference over the drop in frequency. Unfortunately due to the orientation of the PCIe slots on the 2P board, we were unable to test the dual E5-2697 v3 configuration.

F1 2013

First up is F1 2013 by Codemasters. I am a big Formula 1 fan in my spare time, and nothing makes me happier than carving up the field in a Caterham, waving to the Red Bulls as I drive by (because I play on easy and take shortcuts). F1 2013 uses the EGO Engine, and like other Codemasters games ends up being very playable on old hardware quite easily. In order to beef up the benchmark a bit, we devised the following scenario for the benchmark mode: one lap of Spa-Francorchamps in the heavy wet, the benchmark follows Jenson Button in the McLaren who starts on the grid in 22nd place, with the field made up of 11 Williams cars, 5 Marussia and 5 Caterham in that order. This puts emphasis on the CPU to handle the AI in the wet, and allows for a good amount of overtaking during the automated benchmark. We test at 1920x1080 on Ultra graphical settings.

F1 2013 SLI, Average FPS

Bioshock Infinite

Bioshock Infinite was Zero Punctuation’s Game of the Year for 2013, uses the Unreal Engine 3, and is designed to scale with both cores and graphical prowess. We test the benchmark using the Adrenaline benchmark tool and the Xtreme (1920x1080, Maximum) performance setting, noting down the average frame rates and the minimum frame rates.

Bioshock Infinite SLI, Average FPS

Tomb Raider

The next benchmark in our test is Tomb Raider. Tomb Raider is an AMD optimized game, lauded for its use of TressFX creating dynamic hair to increase the immersion in game. Tomb Raider uses a modified version of the Crystal Engine, and enjoys raw horsepower. We test the benchmark using the Adrenaline benchmark tool and the Xtreme (1920x1080, Maximum) performance setting, noting down the average frame rates and the minimum frame rates.

Tomb Raider SLI, Average FPS

Notice zero results from Tomb Raider from our new CPUs? This benchmark does not seem to like any arrangement above 12 cores per socket, and refuses to run.

Sleeping Dogs

Sleeping Dogs is a benchmarking wet dream – a highly complex benchmark that can bring the toughest setup and high resolutions down into single figures. Having an extreme SSAO setting can do that, but at the right settings Sleeping Dogs is highly playable and enjoyable. We run the basic benchmark program laid out in the Adrenaline benchmark tool, and the Xtreme (1920x1080, Maximum) performance setting, noting down the average frame rates and the minimum frame rates.

Sleeping Dogs SLI, Average FPS

Battlefield 4

The EA/DICE series that has taken countless hours of my life away is back for another iteration, using the Frostbite 3 engine. AMD is also piling its resources into BF4 with the new Mantle API for developers, designed to cut the time required for the CPU to dispatch commands to the graphical sub-system. For our test we use the in-game benchmarking tools and record the frame time for the first ~70 seconds of the Tashgar single player mission, which is an on-rails generation of and rendering of objects and textures. We test at 1920x1080 at Ultra settings.

Battlefield 4 SLI, Average FPS

E5-2695 V3 and E5-2697 V3 Conclusion

Reviewing CPUs that differ only in CPU count or clock speed but are based on similar architectures is predominantly a point and click affair. The CPU with the higher single core frequency does well in response focused benchmarks, and those with more cores with enough memory bandwidth to support them. For Xeons, the price difference for more cores or more frequency never makes much sense if you calculate the difference in terms of power, but becomes more realistic when the time-to-complete for intensive workloads is taken into account. To throw a spanner into the mix, as Johan found in his Haswell-EP testing, Xeons like the E5-2667 v3 exist to take advantage of higher AVX clock speeds due to increased power headroom, accelerating mixed-AVX workloads more than a higher core-count model which has to reduce frequency to compensate. The E5-2667 v3 therefore costs an extra $1000+ over its nearest core counterpart.

The two CPUs we have reviewed today, the E5-2695 V3 and E5-2697 V3, fall directly into Intel’s 2.3-2.6 GHz product line for Xeon E5-2600 V3 CPUs. This comes across as the main segmentation for Intel’s binning process from 8-core to 18-core, with the main differentiator relating to core count and pricing. The difference in 300 MHz comes at the expense of 25W TDP on paper as well as $278 in the back pocket.

An interesting point to note about the 2.3-2.6 GHz stack is the turbo frequencies. Not all SKUs are made similar:

Buying a 2.3 GHz base frequency processor with more cores does several things aside from the price: raise the peak turbo frequency, extend the turbo frequencies across more cores and increase the power consumption. As a result, buying the next CPU up in the stack affords more than just a couple of extra cores – there is both single thread and multithread performance of interest. With this in mind, it might be worth examining the Xeon range from through a MHz lens next generation rather than a core viewpoint.

On the 2.6 GHz graph, the 8-core model starts with the higher turbo frequency and has a more regular decline, while the 10-core and 12-core are evenly matched until the 12-core reduces to a lower all-core turbo. The peak turbo frequency again lies with the parts with more cores, and thus ends up with more TDP and cost more to purchase.

I showed the 2.3 GHz graph to some of the other editors and they pointed out the obvious differentiator: the 16 core CPU has a significant MHz advantage from 1-8 core loading over the 14C variant. For any software with a variable threaded load, the 16 core would push the performance and be on par with the 18 core. The price difference between the 14 core model and the 16 core model is nearly $1000, making budgets and workloads being important factors in this decision. The prevalence of multithreaded code is server and workstation environments make the frequency difference extremely important, especially when it comes to the types of workloads that have frequent memory delays and accesses.

This is apt, as our next element of Xeon coverage will be on the 16-core E5-2698 V3. If we can obtain a sample of the E5-2699 V3 as well, it will complete the set.