![](/Content/images/logo2.png)
Original Link: https://www.anandtech.com/show/2143
Quad Core Intel Xeon 53xx Clovertown
by Johan De Gelas on December 27, 2006 5:00 AM EST- Posted in
- IT Computing
Introduction
The Xeon 5160 (a.k.a. Woodcrest) is probably the best server chip Intel has made in the past decade. It delivers high IPC, high (3GHz) clock speed and a surprising low 80W TDP for a dual core design. But Intel's engineers saw that the fastest Woodcrest could score twice for Intel, in a slightly different form... or should we say package. Lower the voltage of a 3GHz from about 1.21 V to slightly less than 1.1V and you can only achieve 2.33GHz. However by running at 2.33GHz and 1.1V, Intel was able to cut the TDP in half. Now place two of these low voltage Xeons in one package and you get a 2.33GHz quad core Xeon and an 80W TDP. This is the essence of the CPU that bears the codename "Clovertown".
![](https://images.anandtech.com/reviews/it/2006/clovertown/Woodcrestsm.gif)
![](https://images.anandtech.com/reviews/it/2006/clovertown/Clovertownsm.gif)
The quad core Xeon runs at less than 1.1V
The really weird thing about all this is that Intel sells those two Xeon 5160 chips with lower voltage for the same price as one of the higher clocked chips. A 2.33GHz quad core Xeon DP costs the same as a 3GHz dual core Xeon DP ($851), and a 1.86GHz quad core will cost as much as a 2.66GHz dual core. In other words, Intel is willing to give one chip "for free" just to "ignite the quad core era".
Intel is also trading in (theoretically almost 30%) single threaded performance for more multi-threaded processing power while keeping the power envelope the same. It is not as radical as Sun's T1, but the philosophy behind it is more or less the same. Is it a good bet? Well, being first to market with a quad core x86 product will probably make a big impact. It should be especially interesting for HPC and rendering applications. If you need the power of four cores in a blade server, the 65nm Clovertown or Xeon 53xx is the ideal CPU for the very crammed, "hard-to-cool" housing.
The Xeon 53xx - with the appropriate BIOS update - should have drop-in capability in 5000(x) based chipset boards. The availability of Xeon 53xx generates some very interesting choices for the server buyer. With 8 cores on a dual socket system, can it replace the more expensive quad socket systems such as those based on the Opteron 8 series or the recently launched Xeon 71xx CPUs? Should you go for a high clocked dual core Xeon or a lower clocked quad core Xeon? The "standard" answer is that it depends on the application, but that answer is boring and hardly informative. In this article we will try to give an answer to these questions as they apply to Rendering, OpenSSL, Java, SAP and MySQL applications. Jason and Ross are busy with the SQL Server benchmarks, so you can expect even more benchmarks soon.
Quad Core Choices
True, a "dual dual core" is not really a quad core, such as AMD's upcoming quad core Barcelona chip. However, from an economical point of view it does make a lot of sense: not only is there the marketing advantage of being the "first quad core" processor, but - according to Intel - using two dual cores gives you 20% higher die yields and 12% lower manufacturing costs compared to a simulated but similar quad core design. Economical advantages aren't the only perspective, of course, and from a technical point of view there are some drawbacks.
Per core bandwidth is one of them, but frankly it receives too much attention. Only HPC applications really benefit from high bandwidth. Let us give you one example. We tested the Xeon E5345 with two and four channels of FB-DIMMs, so basically we tested with the CPU with about 17GB/s and 8.5GB/s of memory bandwidth. The result was that 3dsmax didn't care (1% difference) and that even the memory intensive benchmark SPECjbb2005 showed only an 8% difference. Intel's own benchmarks prove this further: when you increase bandwidth by 25% using a 1333 MHz FSB instead of a 1066 MHz, the TPC score is about 9% higher running on the fastest Clovertown. That performance boost would be a lot less for database applications that do not use 3TB of data. Thus, memory bandwidth for most applications - and thus IT professionals - is overrated. Memory latency on the other hand can be a critical factor.
Unfortunately for Intel, that is exactly the Achilles heel of Intel's current server platform. The numbers below are expressed in clock pulses, except the last column, where we measure in nanoseconds. All measurements were done with the latency test of CPU-z.
CPU-Z Memory Latency | ||||||
CPU | L1 | L2 | L3 | min mem | max mem | Absolute latency (ns) |
Dual DC Xeon 5160 3.0 | 3 | 14 | 69 | 380 | 127 | |
Dual DC Xeon 5060 3.73 | 4 | 30 | 200 | 504 | 135 | |
Dual Quad Xeon E5345 2.33 | 3 | 14 | 80 | 280 | 120 | |
Quad DC Xeon 7130M 3.2 | 4 | 29 | 109 | 245 | 624 | 195 |
Quad Opteron 880 2.4 | 3 | 12 | 84 | 228 | 95 |
Anand's memory latency tests measured a 70 ns latency on a Core 2 using DDR2 667 and a desktop chipset and a 100ns latency for the same CPU with a server chipset and 667 MHz FB-DIMM. Let us add about 10 ns latency for using buffered instead of unbuffered DIMMs and using the 5000p chipset instead of Intel's desktop chipset. Based on those assumptions, a theoretical Xeon DP should be able to access 667 MHz DIMMs in about 80 ns. So using FB-DIMMs result in 25% more latency compared to DDR2, which is considerable. Intel is well aware of this as you can see from the slide below.
![](https://images.anandtech.com/reviews/it/2006/clovertown/FBDIMMlatency.jpg)
An older chipset (Lindenhurst) with slower memory (DDR2-400) is capable of offering lower latency than a sparkling new one. FB-DIMM offers a lot of advantages, such as dramatically higher bandwidth per pin and higher capacity. FB-DIMM is a huge step forward from the motherboard designer point of view. However, with 25% higher latency and 3-6 Watt more power consumption per DIMM, it remains to be seen if it is really a step forward for the server buyer.
How about bandwidth? The bandwidth tests couldn't find any real bandwidth advantage for FB-DIMM. The 533 MHz DDR2 chips delivered about 3.7GB/s (via SSE2), and about 2.7GB/s in "normal" conditions (non-SSE compiled). Compared to DDR400 on the Opteron (4.5GB/s max, 3.5GB/s), this is nothing spectacular. Of course, we tested with stream and ScienceMark, and these are single threaded numbers. Right now we don't have the right multithreaded benchmarking tools to really compare the bandwidth of complex NUMA systems such as our HP DL585 or the DIB (Dual Independent Bus) of the Xeon system.
There are much bigger technical challenges than bandwidth. Two Xeon 53xx CPUs have a total of four L2 caches, which all must remain consistent. That results in quite a bit of cache coherency traffic that has to pass over the FSB to the chipset, and from the chipset to the other independent FSB. To avoid making the four cache controllers listen (snoop) all those messages, Intel implemented a "snoop filter", a sort of cache that keeps track of the coherency state info of all cache lines mapped. The snoop filter tries to prevent unnecessary cache coherency traffic from being sent to the other Independent Bus.
![](https://images.anandtech.com/reviews/it/2006/clovertown/Snoop5000.jpg)
The impact that cache coherency has on performance is not something only academics discuss; it is a real issue. Intel's successor of the 5000p chipset, codenamed "Seaburg", will feature a more intelligent and larger snoop filter (SF) and is expected to deliver 5% higher performance in bandwidth/FP intensive applications (measured in LS Dyna, Fluent and SpecFP). Seaburg's larger SF will be split up into four sets instead of Blackford's two, which allows it to keep track of each separate L2 cache more efficiently.
To quantify the delay that a "snooping" CPU encounters when it tries to get up-to-date data from another CPU's cache, take a look at the numbers below. We have used Cache2Cache before, and you can find more info here. Cache2Cache measures the propagation time from a store by one processor to a load by the other processor. The results that we publish are approximately twice the propagation time.
Cache2Cache Latency | |||||
Cache coherency ping-pong (ns) | Xeon E5345 | Xeon DP 5160 | Xeon DP 5060 | Xeon 7130 | Opteron 880 |
Same die, same package | 59 | 53 | 201 | 111 | 134 |
Different die, same package | 154 | N/A | N/A | N/A | |
Different die, different socket | 225 | 237 | 265 | 348 | 169-188 |
The Xeons based on the Core architecture (E5345 and 5160) can keep cache coherency latency between the two cores to a minimum thanks to the shared L2 cache. When exchanging cache coherency information from one die from another, the Opteron does have an advantage: exchanging data goes 7 to 25% quicker. Note that the "real cache coherency latency" when running a real world workload is probably quite a bit higher for the Xeons. When the FSB has to transfer a lot of data to the memory, the FSB will be "less available" for cache coherency traffic.
![](https://images.anandtech.com/reviews/it/2006/clovertown/OpteronCC.gif)
The Opteron platform can handle its cache coherency traffic via the HyperTransport links, while most of the data is transferred by the onboard memory controller to the local memory. Unless there is a lot of traffic to remote memory, the Opteron doesn't have to send the cache coherency traffic the same way as the data being processed.
So when we look at our benchmarking numbers, it is good to remember that cache coherency traffic and high latency accesses to the memory might slow our multiprocessing systems down.
Server CPUs Overview
As the CPU is still one of the most important cost factors in a server, we want to give an overview of the currently available server CPUs. We'll start with the Intel processors.
Intel Server CPU Overview | |||||||||
Intel CPU | Clock | Codename | L2 | L3 | FSB | Mem bandwidth | TDP | In test? | Price |
Xeon MP 7140M | 3.4 GHz | Tulsa | 2x 1MB | 16 MB | 200 MHz Quad | 6.4 GB/s | 150 W | no | $1980 |
Xeon MP 7130M | 3.2 GHz | Tulsa | 2x 1MB | 8 MB | 200 MHz Quad | 6.4 GB/s | 150 W | yes | $1391 |
Xeon MP 7120M | 3 GHz | Tulsa | 2x 1MB | 4 MB | 200 MHz Quad | 6.4 GB/s | 95 W | no | $1117 |
. | |||||||||
Xeon MP 7041 | 3 GHz | Paxville | 2x 2MB | - | 200 MHz Quad | 6.4 GB/s | 165 W | no | $3157 |
Xeon MP 7030 | 2.8 GHz | Paxville | 2 x 1MB | - | 200 MHz Quad | 6.4 GB/s | 165 W | no | $1980 |
. | |||||||||
Xeon E5355 | 2.66 GHz | Clovertown | 2x 4 MB | - | 333 MHz Quad | 21 GB/s | 120 W | No | $1172 |
Xeon E5345 | 2.33 GHz | Clovertown | 2x 4 MB | - | 333 MHz Quad | 21 GB/s | 80 W | Yes | $851 |
Xeon E5320 | 1.86 GHz | Clovertown | 2x 4 MB | - | 266 MHz Quad | 17 GB/s | 80 W | No | $690 |
Xeon E5310 | 1.6 GHz | Clovertown | 2x 4 MB | - | 266 MHz Quad | 17 GB/s | 80 W | No | $455 |
. | |||||||||
Xeon DP 5160 | 3 GHz | Woodcrest | 4 MB | - | 333 MHz Quad | 21 GB/s | 80 W | Yes | $851 |
Xeon DP 5150 | 2.66 GHz | Woodcrest | 4 MB | - | 333 MHz Quad | 21 GB/s | 65 W | No | $690 |
Xeon DP 5148 | 2.33 GHz | Woodcrest | 4 MB | - | 333 MHz Quad | 21 GB/s | 40 W | No | $519 |
Xeon DP 5140 | 2.33 GHz | Woodcrest | 4 MB | - | 333 MHz Quad | 21 GB/s | 65 W | No | $455 |
Xeon DP 5130 | 2 GHz | Woodcrest | 4 MB | - | 333 MHz Quad | 21 GB/s | 65 W | No | $316 |
Xeon DP 5120 | 1.86 GHz | Woodcrest | 4 MB | - | 266 MHz Quad | 17 GB/s | 65 W | No | $256 |
. | |||||||||
Xeon DP 5080 | 3.73 GHz | Dempsey | 2x 2MB | - | 266 MHz Quad | 8.5 GB/s | 130 W | Yes | $851 |
Xeon DP 5063 | 3.2 GHz | Dempsey | 2x 2MB | - | 266 MHz Quad | 8.5 GB/s | 95 W | No | $369 |
Xeon DP 5060 | 3.2 GHz | Dempsey | 2x 2MB | - | 266 MHz Quad | 8.5 GB/s | 130 W | No | $316 |
. | |||||||||
3.60 GHz | 3.6 GHz | Irwindale | 2 MB | - | 200 MHz Quad | 6.4 GB/s | 130 W | No | n/a |
The Opteron CPU comes in two forms: one for DDR and one for DDR2. The DDR2 version uses four digit model numbers and the DDR version uses three digits. You can find an overview of the older 940-pin versions with DDR in our review of the Xeon MP.
AMD Server CPU Overview | |||||||||
AMD CPU | Clock | Codename | L2 | L3 | HT | Mem bandwidth | TDP | In test? | Price |
Opteron 8220 SE | 2.8 GHz | Santa Rosa | 2x 1 MB | - | 1000 MHz DDR | 10.6 GB/s | 119 W | No | $2149 |
Opteron 8218 | 2.6 GHz | Santa Rosa | 2x 1 MB | - | 1000 MHz DDR | 10.6 GB/s | 95 W | No | $1514 |
Opteron 8216 | 2.4 GHz | Santa Rosa | 2x 1 MB | - | 1000 MHz DDR | 10.6 GB/s | 95 W | No | $1165 |
Opteron 8214 | 2.2 GHz | Santa Rosa | 2x 1 MB | - | 1000 MHz DDR | 10.6 GB/s | 95 W | No | $873 |
Opteron 8216 HE | 2.4 GHz | Santa Rosa | 2x 1 MB | - | 1000 MHz DDR | 10.6 GB/s | 68 W | No | $1340 |
. | |||||||||
Opteron 2220 SE | 2.8 GHz | Santa Rosa | 2x 1 MB | - | 1000 MHz DDR | 10.6 GB/s | 95 W | No | $786 |
Opteron 2216 | 2.6 GHz | Santa Rosa | 2x 1 MB | - | 1000 MHz DDR | 10.6 GB/s | 95 W | No | $611 |
Opteron 2214 | 2.4 GHz | Santa Rosa | 2x 1 MB | - | 1000 MHz DDR | 10.6 GB/s | 95 W | No | $450 |
Opteron 2214 | 2.2 GHz | Santa Rosa | 2x 1 MB | - | 1000 MHz DDR | 10.6 GB/s | 95 W | No | $377 |
Opteron 2216HE | 2.4 GHz | Santa Rosa | 2x 1 MB | - | 1000 MHz DDR | 10.6 GB/s | 68 W | No | $531 |
There are a few things that you should notice. First of all, the current dual core Opterons that are capable of quad socket operation (8xxx) are very expensive. It will be interesting to see whether or not the Xeon "Clovertown" E5345 can attack the best Opterons as Intel's latest CPU is very aggressively priced. Eight 2.33GHz Xeon cores cost about $1700, whereas eight Opteron cores at 2.4GHz cost no less than $4500. Something else to consider is that right now there are few Xeon 53xx based systems. At the time of this writing, we could only find one HP server with the new Xeon quad core: the HP ProLiant BL20p G4 server, which is a blade server. So apart from the blade server market, the Xeon 53xx is not an immediate threat to the quad socket Opteron systems. However, AMD might have to adapt its prices quickly as Clovertown starts to pop up in rack servers, so we think it is very interesting to compare the quad socket dual core Opterons to the dual socket quad core Xeons. Also note that there is a Xeon E5355 CPU which runs at 2.66GHz, but has a 120W TDP. This CPU will probably find a home in some fast workstations and servers where performance matters the most.
Words of Thanks
A lot of people gave us assistance with this project, and we would of course like to thank them.
Jerry R. Baugh, Intel US
Matty Bakkeren, Intel Netherlands
William H. Lea, Intel US
(www.intel.com)
Damon Muzny, AMD US
(www.amd.com)
Bob Cramblitt and Larry D. Gray
(www.spec.org)
Brecht Kets, MySQL patching and tuning
Pieter Beel, SPECjbb benchmarking
Anja Gheldof, MySQL benchmarking
Tijl Deneut, Linux support
Hardware Configurations
We apologize for not including the new AMD Socket F platform. We are working with several manufacturers and AMD to get socket F into our Netherlands lab. We'll definitely report back when we have then new AMD platform available. Until then we will test with what we have: socket 940 Opterons. Here is the list of the different configurations:
Xeon Server 1: Intel "Bensley platform" server
2x Xeon 5160 3GHz or 2x Xeon E5345 at 2.33GHz
Intel Server Board S5000PSL
16GB (8x 2048MB) Micron FB-DIMM Registered DDR2-533 CAS 4, ECC enabled
NIC: Dual Intel PRO/1000 Server NIC
Xeon Server 2: Quad Xeon MP Intel SR6850HW4
Quad Xeon MP 7130M 3.2GHz 8MB L3
Intel 8501 chipset
16GB (8x2048MB) Micron Registered DDR2-400 CAS 3, ECC enabled
NIC: Dual Intel PRO/1000 Server NIC
Opteron Server 1: Quad Opteron HP DL585
Quad Opteron 880 2.4GHz
AMD8000 Chipset
16GB (16x1024MB) Crucial Registered DDR-333 CAS 2.5, ECC enabled
NIC: NC7782 Dual PCI-X Gigabit
Client Configuration: Dual Opteron 850
MSI K8T Master1-FAR
4x512MB Infineon Registered DDR-333, ECC enabled
NIC: Broadcom 5705
Software
Ubuntu 6.06 LTS 64 bit Server Edition (2.6.15-26-amd64-server SMP)
MySQL 5.0.26 with Peter Zaitsev Mutex Patch
SPECjbb2005
Sun Hotspot Java JVM 1.5.0_08
BEA JRockit5.0 P26.4.1JDK 64bit
SPECjbb2005
SPECjbb2005 from SPEC (Standard Performance Evaluation Corporation) evaluates the performance of server side Java by emulating a three-tier client/server system with emphasis on the middle tier. Instead of testing with a possible disk intensive database system, SPECjbb uses tables of objects, implemented by Java Collections, rather than a separate database. The SPECjbb score thus depends on:
- The JVM (Java Virtual Machine) and the way the JVM is tuned
- CPU processing power
- Caching and memory speed
- Multiprocessing configuration (Scalability)
"SPECjbb2005 is a follow-on release to SPECjbb2000, which was inspired by the TPC-C benchmark and loosely follows the TPC-C specification for its schema, input generation, and transaction profile. SPECjbb2005 runs in a single JVM in which threads represent terminals, where each thread independently generates random input before calling transaction specific logic. There is neither network nor disk IO in SPECjbb2005."SPECjbb starts up to two threads per core. For example, with Hyper-Threading enabled on our eight core/quad CPU Xeon MP 7030M system, 32 threads were started on the 16 logical CPUs. Each thread is a warehouse. Again from SPEC.org:
"A warehouse is a unit of stored data. It contains roughly 25MB of data stored in many objects in several Collections (HashMaps, TreeMaps). A thread represents an active user posting transaction requests within a warehouse. There is a one-to-one mapping between warehouses and threads, plus a few threads for SPECjbb2005 main and various JVM functions. As the number of warehouses increases during the full benchmark run, so does the number of threads. A "point" represents the throughput during the measurement interval at a given number of warehouses. A full benchmark run consists of a sequence of measurement points with an increasing number of warehouses (and thus an increasing number of threads)"First we tested with some decent but rather generic tuning that we could use on all systems. The JVM was Sun's, version 1.5.0_08.
java -classpath jbb.jar:check.jar -Xms3072m -Xmx3072m -Xmn1024m -Xss128k -XX:+AggressiveOpts -XX:+UseParallelOldGC -XX:+UseParallelGC spec.jbb.JBBmain -propfile SPECjbb.props
![](https://images.anandtech.com/reviews/it/2006/clovertown/Specjbb1instance.gif)
Our first test is done with only one instance, and you might recall from our Xeon MP coverage that this is a setup that the Opteron does not like. Let us focus mostly on the Intel results. Interestingly, the "Core based" Xeon 5345 cannot outperform the "Pentium 4 based" Xeon MP 7130. The higher clock speed of the Xeon MP (3.2GHz) helps of course, but it is still a surprise, especially considering that the cache system of the Xeon 5345 is quite competitive (4MB low latency L2 per two cores) compared to the Xeon MP (1MB L2 per core, a high latency 8MB L3 per two cores). Clovertown has also the better memory subsystem, especially if you compare the memory latency (120 versus 195 ns).
Next, we also tested SPECjbb with four application instances. Using NUMActl, a clever utility written by Andi Kleen, we were able to bind each Java application to one CPU node on the HP DL585. We didn't bind instances to CPUs on the Intel platforms (it is possible with taskset) as it gives worse performance.
On the Opteron we used:
numactl -cpubind=(1-4) -membind=(1-4) java -classpath jbb.jar:check.jar -Xms3072m -Xmx3072m -Xmn1024m -Xss128k -XX:+AggressiveOpts -XX:+UseParallelOldGC -XX:+UseParallelGC spec.jbb.JBBmain -propfile SPECjbb.props -id (1-4)
On the Xeons we used:java -classpath jbb.jar:check.jar -Xms3072m -Xmx3072m -Xmn1024m -Xss128k -XX:+AggressiveOpts -XX:+UseParallelOldGC -XX:+UseParallelGC spec.jbb.JBBmain -propfile SPECjbb.props -id (1 to 4)
![](https://images.anandtech.com/reviews/it/2006/clovertown/Specjbb4instance.gif)
As we have noticed before, the Xeons do not benefit from using more instances, while Opteron performance is boosted significantly. That is quite good news for AMD, as testing with multiple instances is more realistic according to most java people we talked to. The four dual core 2.4GHz Opterons outperform the 2.33GHz Xeon E5345 by a small margin. This really deserves more attention, as normally the core based CPUs are capable of outperforming similarly clocked Opterons by a 20% margin and more. We decided to check out the scaling of the different CPUs by testing with four and eight cores. We also tested the Opteron 880 with DDR-400. Unfortunately, we were not able to test with more than 8GB, so we could only test with two CPUs. The blue numbers are extrapolated numbers. The 2.8GHz Opteron numbers were based on the performance scaling we saw from 2.2GHz Opteron to a 2.4GHz Opteron.
Specjbb2005 4 instances (Sun Hotspot) Per core performance |
|||
CPU | Quad core | Octal core | Scaling 4->8 |
Xeon 7130 3.2 GHz | 39942 | 72980 | 83% |
Xeon 5345 2.33 GHz | 39781 | 67447 | 70% |
Opteron 880 2.4 GHz | 37397 | 71364 | 91% |
Opteron 880 2.4 GHz DDR400 | 41137 | 78500 | 91% |
Opteron 890 2.8 GHz | 46073 | 87920 | 91% |
Xeon 5160 3 GHz | 47743 | N/A | N/A |
Xeon Scaling 2.33 -> 3 GHz | 20% | ||
Opteron 880 vs. Quad core Xeon 2.33 GHz | 3% | 16% | 31% |
Here is a first indication that quad core Xeon does not scale as well as the other systems. Two 2.4GHz Opteron 880 processors are as fast as one Xeon 5345, but four Opterons outperform the dual quad core Xeon by 16%. In other words, the quad Opteron system scales 31% better than the Xeon system.
When you are in the market for a new server system, you typically care less about performance per core; instead, you care about the performance per dollar. That is why we should also look at performance per socket. If we look at typical HP systems for example, a two socket system with 8GB RAM can be found in the $6000-$7000 price range, a similar quad socket system can cost $11000-14000.
Specjbb2005 4 instances Per socket performance |
|
CPU | Dual Socket |
Quad core Xeon 2.33 GHz vs. Xeon 5160 | 41% |
Quad core Xeon 2.33 GHz vs. Opteron 880 | 64% |
Quad core Xeon 2.33 GHz vs. Opteron 890 | 46% |
The Xeon 5345 might scale worse than the Opteron, but it offers a remarkable price/performance ratio. The Opteron 890/8220 costs about the same as Xeon 5345, but to be fair FB-DIMMs seem to be about 30% more expensive than comparative DDR2 DIMMs. In case of 8GB of RAM, this might amount to an extra cost of $300, making the Xeon 5345 system more expensive. Still the Xeon 5345 offers a compelling performance advantage.
Specjbb 2005 - Bea JRockit
We suspected that the Sun JVM is reasonably well optimized for the Opteron and maybe a little bit less effort went into the Intel optimizations. After all, Sun sells Opteron and Sparc servers. The BEA JRockit JDK provides a highly optimized JVM for running JAVA applications on the x86-64 and Itanium CPUs, so we did also some testing with the BEA Jrockit JVM. BEA is known for being a rather memory gobbling but highly tunable JVM, so we aggressively tuned our server JVM.
On the Xeons we used following parameters:
/java/jrockit-jdk1.5.0_06/bin/java -cp jbb.jar:check.jar -Xms2048m -Xmx2048m -XXaggressive -XXthroughputcompaction -XXallocprefetch -XXallocRedoPrefetch -XXcompressedRefs -XXlazyUnlocking -XXtlasize128k spec.jbb.JBBmain -propfile SPECjbb.props -id 1-4
On the Opterons we used the following parameters:numactl --cpubind=0-4 --membind=0-4 /java/jrockit-jdk1.5.0_06/bin/java -classpath jbb.jar:check.jar -XXaggressive -XXcompressedRefs -XXthroughputCompaction -XXlazyUnlocking -XXtlasize=64k -Xms1536m -Xmx1536m spec.jbb.JBBmain -propfile SPECjbb.props -id 1-4
![](https://images.anandtech.com/reviews/it/2006/clovertown/BeaSpecjbb4instance.gif)
As we suspected, Jrockit is better optimized for Intel. A single Xeon 5345 outperforms a dual Opteron 880 by a large margin (26-39%). The victory is significant; however, the Clovertown scaling remains quite mediocre.
Specjbb2005 / Bea Per core performance |
|||
CPU | Quad core | Octal core | Scaling 4->8 |
Xeon 7130 3.2 GHz | 50000 | 85909 | 72% |
Xeon 5345 2.33 GHz | 70035 | 103957 | 48% |
Opteron 880 2.4 GHz | 50346 | 92213 | 83% |
Opteron 880 2.4 GHz DDR400 | 55381 | 101434 | 83% |
Xeon 5160 3 GHz | 79154 | N/A | N/A |
Xeon Scaling 2.33 -> 3 GHz | 13% | ||
Opteron 880 vs. Quad core Xeon 2.33 GHz | -28% | -11% | 72% |
Even with DDR-400, a dual Opteron 880 is not able to come close to a single Xeon E5345. However, the picture changes when we look at the "octal core" numbers. A dual Xeon E5345 is only 50% faster, while the Opteron increases its performance by 83% when the number of cores doubles.
Specjbb2005 / Bea Per socket performance |
|
CPU | Dual Socket |
Quad core Xeon 2.33 GHz vs. Xeon 5160 | 41% |
Quad core Xeon 2.33 GHz vs. Opteron 880 | 64% |
Still, the Quad core Xeon is still a champion, offering 41% more performance for the same price as its 3GHz dual core brother. If you are using the BEA JVM, the Xeon is a much better choice than the AMD Opteron.
Secure Socket Layers RSA Performance
Secure web communication is possible through the utilization of the Secure Sockets Layer (SSL). Using "openssl speed rsa" we can measure the number of RSA public keys (sign) operations that a system can perform per second using OpenSSL 0.9.8a. Both verifies/s and signs/s benchmarks are rather synthetic, but give an idea of the "pure" encrypting and decrypting speed.
Note that this time we did not compile OpenSSL with specific flags for each architecture (march="xxx") but we used the same flags on each CPU. We feel that this better reflects the real world use of SSL as most people do not know the specific CPU architecture they are running on. So we compiled with the following on all x86 systems:
gcc -fPIC -DOPENSSL_PIC -DZLIB -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -m64 -DL_ENDIAN -DTERMIO -O3 -Wa,-noexecstack -g -Wall -DMD32_REG_T=int -DMD5_ASM
We also included the T2000 numbers with MAU acceleration via the Solaris Cryptographic Framework from our previous server CPU shootout. One thread of OpenSSL Signing per core is optimal.![](https://images.anandtech.com/reviews/it/2006/clovertown/CloverOpenSSLs.gif)
In the case of doing verifies, the server has to authenticate the identity of the client. This is a lot less intensive, and we show you the verifies/s numbers at 2048 bits. At 1024 bits length, both the Woodcrest and Opteron were able to verify more than 50,000 keys per core, and that is a hard limit of the OpenSSL benchmark.
![](https://images.anandtech.com/reviews/it/2006/clovertown/CloverOpenSSLv.gif)
Both benchmarks behave as expected. All CPUs scale almost perfectly: this benchmark runs in the caches. The Opteron remains on top, as it offers better OpenSSL performance per clock.
MySQL Configuration
To avoid the scaling problems of MySQL, we compiled version 5.0.26 with Peter Zaitsev's Mutex patch. This Patch gives much better scaling and performance using up to four cores. Eight cores and more give variable results. All testing was done with InnoDB as our storage engine in MySQL 5.0.26. Here is our MySQL configuration:
MySQL Configuration | |
default-storage-engine | InnoDB |
skip-external-locking | |
skip-locking | |
key_buffer | 256M |
. | |
table_cache | 64 |
max_allowed_packet | 1M |
thread_stack | 128K |
. | |
sort_buffer_size | 2M |
read_buffer_size | 2M |
innodb_buffer_pool_size | 1G |
. | |
thread_concurrency | 16 |
innodb_thread_concurrency | 16 |
innodb_additional_mem_pool_size | 8MB |
read_rnd_buffer_size | 8MB |
thread_cache | 64 |
max_heap_table | 256MB |
tmp_table | 128MB |
. | |
innodb_log_file_size | 250MB |
innodb_table_locks | 0 |
innodb_flush_log_at_trx_commit | 0 |
max_user_connections | 2000 |
max_connections | 2000 |
The "query cache" was off, as we wanted to test worst case performance. Our test database is still the same ~1GB database. The workload consists of more than 90% selects, mostly a "read intensive" workload.
MySQL results
All numbers are expressed in queries per second (Y-axis), and the X-axis shows the number of concurrent accesses.
![](https://images.anandtech.com/reviews/it/2006/clovertown/Clovermysql.gif)
While the Opteron's performance decreases when we add another 4 cores, a second Xeon E5345 pushes the number of queries/s slightly higher. Clearly, MySQL is not ready for more than four cores right now, and it serves as a great reminder for all those with wild "Tens of cores on one die" dreams: making software scale with massive multi-core systems is and will never be easy. Below you can see the scaling of MySQL running on one Xeon 5160 (one core disabled), two (one CPU) and four (Dual CPU configuration).
MySQL Core Scaling | |||
Concurrency | 1 core | 2 cores | 4 cores |
5 | 735 | 900 | 1082 |
10 | 826 | 1082 | 1267 |
25 | 823 | 1105 | 1323 |
50 | 780 | 1109 | 1319 |
100 | 689 | 1075 | 1196 |
For those running MySQL, clock speed still rules. One 3GHz Xeon 5160 is already capable of no less than 1000-1100 queries/s. Compare this with the clock speed scaling (1 core):
MySQL Clock Scaling | ||
Concurrency | 2.33 GHz | 3 GHz |
5 | 568 | 735 |
10 | 647 | 826 |
25 | 619 | 823 |
50 | 579 | 780 |
100 | 531 | 689 |
You can see that a 28% higher clock speed results in 28% higher performance. We can conclude that clock speed still matters, and that it is often much harder to get more performance out of multiple cores, even in applications that are relatively easy split up into threads.
Although our current DB2 results are "beta" and not ready for publication, we can already say now that DB2 is slower than MySQL but scales much better. We get an 80% increase from 2 to 4 cores.
ERP: SAP Sales & Distribution
Enterprise Resource Planning software is one type of very complex database application. Studies have shown that the performance profile of these applications can be significantly different from that of the underlying database. So we decided to take a look at SAP's benchmark database, to see if we can extract some extra benchmark information to complete our view of the quad core Xeon.
The results below are two tier benchmarks, so the database and the underlying OS can make a big difference. Unless we keep those parameters the same, we cannot compare the results. As the only results for the quad core Xeon which are available have been run on Windows 2003 Enterprise Edition and MS SQL Server 2005 (both 64 bit), we filtered the results to find systems that were run on the same OS and database. With the exception of the Xeon 5160, all systems are equipped with 32GB of RAM. All these benchmarks are done on the SAP "ERP release 2005" two tier Sales & Distribution benchmark.
SAP ERP Release 2005 Windows 2003 EE |
||||||
CPU | Cores | CPU Type | CPU Speed (MHz) | Response Time (s) | SAPS | Central Server |
4 | 8 | Intel Xeon 7041 | 3000 | 1.97 | 5630 | Hitachi HA8000 Model 270 |
2 | 4 | Intel Xeon 5160 | 3000 | 1.71 | 5020 | Hitachi HA8000 Model 130 |
2 | 8 | Quad-Core Intel Xeon Processor X5355 | 2660 | 1.97 | 8770 | FS PRIMERGY Model TX300 S3 / RX300 S3 |
2 | 8 | Quad-Core Intel Xeon Processor X5355 | 2660 | 1.98 | 8970 | HP ProLiant DL380 G5 |
Unfortunately we have no comparison with an Opteron system. We can solve that by keeping every parameter the same, but now we take a look at the benchmarking that happened on SAP release 2004.
SAP ERP Release 2004 Windows 2003 EE |
||||||
Number of processors | Number of cores | CPU Type | CPU Speed (MHz) | Response Time (s) | SAPS | Central Server |
4 | 8 | Intel XEON 7140M | 3400 | 1.99 | 10650 | HP ProLiant DL580 G4 |
4 | 8 | AMD Opteron processor Model 8220SE | 2800 | 1.97 | 9920 | HP ProLiant DL585 G2 |
4 | 8 | Intel XEON 7140M | 3400 | 1.97 | 9850 | Dell PowerEdge 6850 |
4 | 8 | AMD Opteron processor Model 8218 | 2600 | 1.98 | 9570 | HP ProLiant BL45p G2 |
4 | 8 | AMD Opteron processor Model 885 | 2600 | 1.97 | 8520 | Sun Blade x8400 |
4 | 8 | AMD Opteron processor Model 880 | 2400 | 1.96 | 7520 | FS PRIMERGY Model BX630 |
4 | 8 | AMD Opteron processor Model 875 | 2200 | 1.84 | 7020 | FS PRIMERGY Model BFa40 |
2 | 4 | Intel XEON 5160 | 3000 | 1.98 | 5780 | FS PRIMERGY Model RX200 S3 |
2 | 4 | AMD Opteron processor Model 880 | 2400 | 1.87 | 4400 | FS PRIMERGY Model BX630 |
If you go to SAP's two tier benchmark results page, you will notice that the performance differences between similar systems benchmarked on release 2004 and 2005 are minor. The reason why the difference between the Xeon 5160 servers in our tables is about 15% is that the first benchmark resulted in a 1.71 response time and the second had a response time of 1.98. Given a similar response time we can be pretty sure that the results would be very similar. To summarize, if we keep all parameters the same, the benchmark results of the first table should be comparable to the results of the second table. So, while it is not an exact science, a dual quad core Xeon at 2.66GHz should be about 55% faster than a dual Xeon 5160. It is also a bit slower than a quad socket Opteron 2.6GHz system. SAP scales very well with additional cores; based on our assumptions we may conclude that both the Xeon and the Opteron system improve by about 70% when moving from four to eight cores.
One other item warrants mention: the Xeon MP "Tulsa" seems to outperform its dual socket sibling by a small margin, and confirms the good integer performance profile than we noticed in Specjbb 2005.
Render Servers
To get a better idea on how the different server platforms compare, we did some rendering too. Most of our tests (MySQL, DB2, and SPECjbb2005) are very integer intensive, whereas render tests are floating point intensive. We start with a simple Cinebench 9.5 benchmark (on Windows 2003 32 bit), which is based on Maxon's Cinema 4D rendering engine.
Cinebench 9.5 | |
CPU | 1280x720 |
Quad Opteron 880 2.4 | 1720 |
Dual Quad Xeon E5345 2.33 | 1686 |
Dual DC Xeon 5160 3.0 | 1456 |
Quad Xeon E5345 2.33 | 1272 |
Quad DC Xeon 7130M 3.2 | 1169 |
Dual Opteron 880 2.4 | 1121 |
Dual DC Xeon 5060 3.73 | 1079 |
Dual DC Xeon 7130M 3.2 | 889 |
Four 2.4GHz Opteron cores are a bit slower than four 2.33GHz Xeons, but when we look at the eight core scores the Opteron is a bit faster. Again, it seems that the Opteron system scales better.
Cinebench 9.5 (32 bit) Per core performance |
|||
CPU | Quad core | Octal core | Scaling 4->8 |
Xeon 7130 3.2 GHz | 889 | 1272 | 43% |
Xeon 5345 2.33 GHz | 1169 | 1686 | 44% |
Opteron 880 2.4 GHz | 1121 | 1720 | 53% |
Opteron 890 2.8 GHz | 1297 | 1990 | 53% |
Xeon 5160 3 GHz | 1456 | N/A | N/A |
. | |||
Xeon Scaling 2.33 -> 3 GHz | 25% | ||
Opteron 880 vs. Quad core Xeon 2.33 GHz | -4% | 2% | 21% |
Why do we analyze this in so much detail? Cinebench, like most renders, couldn't care less about the memory subsystem. We tested our Clovertown system with two or four memory channels and the results were exactly the same. Therefore, we are pretty sure the slightly worse scaling of the Xeon E5345 is not a result of limited bandwidth or higher latency. There must be something else that limits scalability, and that something else is most likely cache coherency.
Cinebench 9.5 (32 bit) Per socket performance |
|
CPU | Dual Socket |
Quad core Xeon 2.33 GHz vs. Xeon 5160 | 16% |
Quad core Xeon 2.33 GHz vs. Opteron 880 | 50% |
Quad core Xeon 2.33 GHz vs. Opteron 890 | 30% |
Cinebench is popular because it is an easy benchmark, but 3dsmax is a very popular application. We tested with 3dsmax version 9, which has been improved to work better with multi-core systems. We used the "architecture" scene, which has been our favorite benchmarking scene for years. All tests were done with 3dsmax's default scanline renderer, SSE enabled and we rendered at HD 720p resolution. We measure the time it takes to render frames 20 to 22.
![](https://images.anandtech.com/reviews/it/2006/clovertown/3DSarchitecture.jpg)
3DS Max 9 Architecture | |
CPU | 1280x720 |
Quad Opteron 880 2.4 | 273 |
Dual Quad Xeon E5345 2.33 | 308 |
Dual DC Xeon 5160 3.0 | 309 |
Quad Xeon E5345 2.33 | 392 |
Dual DC Xeon 5060 3.73 | 419 |
Quad DC Xeon 7130M 3.2 | 443 |
Dual Opteron 880 2.4 | 454 |
This cannot be a coincidence anymore: a single Xeon E5345 leaves the dual Opteron 880 far behind, but a dual Xeon E5345 trails the quad Opteron. It is not only the application that matters; the dataset has an impact too. Take a look at the table below where rendered at 720p and 480p resolution.
3DS Max 9 Architecture | |||
CPU | 720x480 | 1280x720 | |
Quad Opteron 880 2.4 | 137 | 273 | |
Dual Quad Xeon E5345 2.33 | 138 | 308 | |
Dual DC Xeon 5160 3.0 | 133 | 309 | |
Quad Xeon E5345 2.33 | 167 | 392 | |
Dual DC Xeon 5060 3.73 | 188 | 419 | |
Quad DC Xeon 7130M 3.2 | 201 | 443 | |
Dual Opteron 880 2.4 | 196 | 454 | |
. | |||
Scaling Opteron 880 | 43% | 66% | |
Scaling Xeon E5345 | 21% | 27% |
As you can see, the resolution at which you normally render determines how much you benefit from eight cores. Using an octal core machine to render relatively low resolution movies is like driving a potent 8 cylinder engine in a crowded city: all the horsepower goes to waste as you accelerate for a short period and then hit the brakes when approaching a red light. The same is true for rendering: unless you are rendering a complex scene at high resolution, the multi-core engine can never show its full potential. Thanks to better scaling, the quad Opteron platform has still a small advantage.
3DSMax 9 (32 bit) Per socket performance |
|
CPU | Dual Socket |
Quad core Xeon 2.33 GHz vs. Xeon 5160 | 0% |
Quad core Xeon 2.33 GHz vs. Opteron 880 | 47% |
Quad core Xeon 2.33 GHz vs. Opteron 890 | 27% |
However, when it comes to price/performance, it is not the quad core Xeon or the Opteron that wins, but most likely the Xeon 5160. It is more flexible as it will outperform the quad core Xeon in any scene that is not as complex as architecture and resolutions that are lower than 720p. Only if your scenes use radiosity lighting can we see a clear advantage for using the quad core Xeon. We noticed that the Xeon was up to 40% faster in such scenes.
Analysis
When it comes to power consumption numbers, we could only compare the quad core Xeon with the dual core one. Our Opteron platform was too different to make such comparisons meaningful. In Specjbb2005, the Clovertown machine consumed about 20 to 40 Watt more, which is about 8 to 15 percent. If you want to compare the Xeon with the Opteron platform, check out Jason and Ross' testing here.
Let us summarize what we have learned so far. Thanks to the very competitive price, the new quad core Xeon is in many applications a winner when it comes to price/performance: a dual socket server is a lot cheaper than a quad socket model and a 2.33GHz quad core Xeon costs the same as a dual core Xeon 5160. Despite the very aggressive price setting and the excellent per socket performance, the newest Xeon is not unbeatable, a result of mediocre scaling.
To put everything in perspective, you can never give enough benchmarking numbers. Let us see what Intel's own benchmarking tells us. First Intel shows off some benchmarks which really demonstrate fantastic scaling moving from one core to eight.
![](https://images.anandtech.com/reviews/it/2006/clovertown/CloverScaling.jpg)
Eight benchmarks which scale excellent makes for a very nice chart, but in reality we are only looking at three different benchmarks. Blast, Linpack and matrix multiplies can all be categorized as matrix multiply benchmarks. Scholes and Sungard ACR use different algorithms, but both are products of Sungard and both are used in the financial world. They have a high value to financial analysts who use these tools, and complete our analysis. We found that per socket, the quad core Xeon offers about twice the performance of the dual Opteron per socket, and about 10% per core (Octal core 2.33GHz Xeon versus 2.4GHz octal Opteron). So yes, there are applications out there where a "Clovertown" Xeon is a huge step forward. The question is how common are these situations? Intel seems to have anticipated this question and the performance group of Portland presented a very interesting slide.
![](https://images.anandtech.com/reviews/it/2006/clovertown/CloverScaling2.jpg)
Notice that Intel uses a 120W TDP 2.66GHz quad core Xeon and not the 2.33GHz we tested. Considering that everybody - including Intel - agrees that we should go for maximum performance/watt, we would chose the 2.33GHz instead, as it has the same TDP as the Xeon 5160 3GHz which is used as baseline. This means that we have to subtract about 13% of the performance figures if we want to keep the TDP the same, and in that case some of the "compelling gains" are no longer really tangible. So we can conclude that CRM, Financial analysis, ERP and Java applications are the best applications for our Clovertown Xeon. For rendering, transaction processing, and especially structural simulation (LS Dyna) and flow modeling (fluent) the picture is a lot less clear.
Conclusion
The aggressive pricing puts the expensive quad socket systems with the Xeon MP and Opteron 8xx(x) under fire. Some customers will still prefer the slightly better RAS features of the latter, but let's be honest: a large part of the market will be quite happy with the more than decent RAS features of the dual socket Intel platform. The S5000PSL for example supports memory sparing and mirroring aside from the obligatory ECC RAM.
The introduction of the new Xeon quad core should still have a big impact, and it is only the beginning. In Q2 2005, we saw the introduction of the Opteron 2005, and less than two years later the number of cores on one socket has doubled again. The increase in multi-core power is outpacing the natural growing demands of software. The introduction of the new "Barcelona" quad core and Intel's "Tigerton" will make the current high-end systems (8-32 socket) retreat to an ever shrinking market niche.
![](https://images.anandtech.com/reviews/it/2006/clovertown/Clovertulsa.jpg)
The Dual core Xeon MP "Tulsa" looks pretty "fat" compared to the quad core Xeon E5345
To the financial analysts, CRM, ERP and Java server people, the new quad core Xeon E53xx is close to irresistible. You can get four cores for the price of two, or up to eight (!) cores in a relatively cheap dual socket server. We observed at least a 40% performance increase compared to probably the best dual core CPU of today: the Xeon 5160.
For the people looking for a 3D rendering workstation, your usage model will determine whether the Xeon 5160 or the Xeon E5345 is the best solution. You get better animation and 3D manipulation performance (mostly single threaded) and better rendering performance at resolutions lower than High Definition with the Xeon 5160. 3D render servers are better off with the Quad Xeon E53xx but only if they have to render at 720p or full HD (1080p) resolutions.
The past 6 months have been excellent for Intel: after regaining the performance crown in the dual socket server market, there is also now a very viable and lowly priced alternative for the more expensive quad Opteron based systems. However, it is not all bad news for AMD. The current quad core might be good for Intel's yields, time to market, and production costs, but it does have a weakness. The quad core Xeon scaling is very mediocre, and this despite a high performance chipset. The current 5000p chipset has a large 16MB snoop filter, reads speculatively to decrease memory latency, and has a whole other bag of clever tricks to get more performance out of the platform. Despite all this and a 2x4MB L2 cache setup, the quad core Xeon scales worse than the relatively old quad Opteron platform.
Let us summarize:
AMD Quad Opteron Platform
- Advantages:
- Still the best performing FP platform: highest rendering performance
- Scales better than comparable Intel platform
- Cons:
- Expensive 8xx(x) CPUs and expensive platform (motherboard)
- (Slightly) lower integer performance than E5345
- Lower performance/Watt than Xeon E5345
Intel Quad Xeon MP Platform
- Advantages:
- Better RAS than other platforms
- Good integer performance thanks to huge L3 cache
- Cons:
- Expensive MP CPUs, especially compared to Xeon E5345, and very expensive platform (motherboard, memory boards.)
- Pretty bad FP/rendering performance
- Very high latency memory subsystem, L3 cache. (bad HPC performance)
- Bad Performance/Watt, compared to Xeon E53xx and Opteron
Intel Dual Xeon Platform / Clovertown
- Advantages:
- Quad socket performance...
- ...For very low dual socket price in CRM, SAP, Financial analyses and Java server
- Excellent rendering performance at high resolutions (>=720p)
- In some cases, a simple upgrade for Xeon 51xx.
- Cons:
- Mediocre scaling in many applications
- Slightly higher power consumption but little or no performance gain compared to Xeon 5160 in flow modeling, 3D rendering (lower resolutions), structural simulation, MySQL and TPC.
A look into the future
Quite some time ago, Pat Gelsinger of Intel showed a CPU that was called "Clovertown MP". Clovertown MP does not exist (anymore) according to all Intel representatives we talked to. So is Tigerton the new Clovertown MP? It does seem to have two dual core dies just like Clovertown and runs at the same maximum clock speed as Clovertown (2.66GHz), so it is very likely that Tigerton is very similar to or even a rebadged Clovertown MP. Another indication is the Clarksboro chipset, which has four DIBs, a gigantic 64MB snoop filter, and other features designed to tackle the scaling problems that we noticed. We are not sure that it will be enough.
It is quite possible, assuming that AMD executes well, that AMD will keep the advantage in the four socket server market with its new Barcelona core in 2007. Its current platform already scales well, and AMD has made a lot of improvements that help scaling. The upcoming Barcelona core has one L3 cache per four cores (less cache coherency traffic), faster and more HT ports, and so on. There are certainly interesting times ahead... But a bird in the hand is worth two in the bush, so until AMD's quad core Opteron actually ships, Intel has the most attractive dual socket platform.