Name: The Opteron 6276: a closer look
Item: The Opteron 6276: a closer look
Author: Johan De Gelas

Original Link: https://www.anandtech.com/show/5279/the-opteron-6276-a-closer-look

The Opteron 6276: a closer look

VIEW ARTICLE

by Johan De Gelas on February 9, 2012 6:00 AM EST

46 Comments

When we first looked at the Opteron 6276, our time was limited and we were only able to run our virtualization, compression, encryption, and rendering benchmarks. Most servers capable of running 20 or more cores/threads target the virtualization market, so that's a logical area to benchmark. The other benchmarks either test a small part of the server workload (compression and encryption) or represent a niche (e.g. rendering), but we included those benchmarks for a simple reason: they gave us additional insight into the performance profile of the Interlagos Opteron, they were easy to run, and last but not least those users/readers that use such applications still benefit.

Back in 2008, however, we discussed the elements of a thorough server review. Our list of important areas to test included ERP, OLTP, OLAP, Web, and Collaborative/E-mail applications. Looking at our initial Interlagos review, several of these are missing in action, but much has changed since 2008. The exploding core counts have made other bottlenecks (memory, I/O) much harder to overcome, the web application that we used back in 2009 stopped scaling beyond 12 cores due to lock contention problems, the Exchange benchmark turned out to be an absolute nightmare to scale beyond 8 threads, and the only manageable OLTP test—Swingbench Calling Circle—needed an increasing number of SSDs to scale.

The ballooning core counts have steadily made it harder and even next to impossible to benchmark applications on native Linux or Windows. Thus, we reacted the same way most companies have reacted: we virtualized our benchmark applications. It's only with a hypervisor that these multi-core monsters make sense in most enterprises, but there are always exceptions. Since quite a few of our readers still like seeing "native" Linux and Windows benchmarks, not to mention quite a few ERP, OLTP, and OLAP servers are still running without any form of virtualization, we took the time to complete our previous review and give the Opteron Interlagos another chance.

Benchmark Configuration

Since AMD sent us a 1U Supermicro server, we had to resort to testing with our 1U servers again. That is why we went back to the ASUS RS700 for the Xeon server.

Supermicro A+ server 1022G-URG (1U Chassis)

CPU	2x AMD Opteron Interlagos 6276 (2.3GHz, 8 cores per CPU, 16 integer clusters) 2x AMD Opteron Interlagos 6220 (3.0GHz, 4 cores per CPU, 8 integer clusters) 2x AMD Opteron Magny-Cours 6174 (2.2GHz, 12 cores per CPU)
RAM	64GB (8x8GB) DDR3-1600 Samsung M393B1K70DH0-CK0
Motherboard	SuperMicro H8DGU-F
Chipset	AMD Chipset SR5670 + SP5100
BIOS version	v2.81 (10/28/2011)
PSU	SuperMicro PWS-704P-1R 750Watt

The AMD CPUS have four memory channels per CPU. The new Interlagos Bulldozer CPU supports DDR3-1600 and thus our dual-CPU configuration uses eight DIMMs for maximum bandwidth and performance. We ran with one DIMM per channel.

Asus RS700-E6/RS4 1U Server

CPU	2x Intel Xeon X5650 (2.66GHz, 6 cores/12 threads)
RAM	48GB (12x4GB) Kingston DDR3-1333 FB372D3D4P13C9ED1
Motherboard	Asus Z8PS-D12-1U
Chipset	Intel 5520
BIOS version	1102 (08/25/2011)
PSU	770W Delta Electronics DPS-770AB

To speed up testing, we ran the Intel Xeon and AMD Opteron system in parallel. As we didn't have more than eight 8GB DIMMs, we used our 4GB DDR3-1333 DIMMs for the Xeon server. The Xeon system only ends up with 48GB, but this is no disadvantage as our benchmark with the highest memory footprint (Nieuws.be/SQL Server 5 tiles) uses no more than 30GB of RAM.

We measured the difference between 12x4GB and 8x8GB of RAM and recalculated the power consumption for our power measurements (note that the differences were very small). There is no practical alternative as our Xeon has three memory channels and cannot be optimally configured with the same amount of RAM as our Opteron system (which has four channels).

We chose the Xeons based on AMD's positioning. The Xeon X5649 is priced at the same level as the Opteron 6276 but we didn't have the X5649 in the labs. As we suggested in our previous article, the Opteron 6276 should reach the performance of the X5650 to be attractive, so we tested with the X5650.

Common Storage System

Both servers used intel 710 SSDs for storing the database.

Software configuration

All Windows testing was done on Windows 2008 R2 SP1. The Linux tests are done on Ubuntu 11.10 Linux kernel 3.0.0-14 SMP x86_64.

Other

Both servers were fed by a standard European 230V (16 Amps max.) powerline. The room temperature was monitored and kept at 23°C by our Airwell CRACs. We used the Racktivity ES1008 Energy Switch PDU to measure power. Using a PDU for accurate power measurements might seem pretty insane, but this is not your average PDU. Measurement circuits of most PDUs assume that the incoming AC is a perfect sine wave but it never is. However, the Rackitivity PDU measures true RMS current and voltage at a very high sample rate: up to 20,000 measurements per second for the complete PDU.

SQL Server 2008 Enterprise R2

We have been using the Flemish/Dutch web 2.0 website Nieuws.be for some time. 99% of the loads on the database are selects and about 5% of them are stored procedures. You can find a more detailed description here.

We have improved our testing methodology and updated the SQL Server version, so the results are no longer comparable with previous results. We used to publish the highest throughput possible, but we have found that it is not entirely fair. At the highest throughput, there was a very small (<1%) percentage of connection errors (client side timeouts), but those timeouts could make the results vary by about 5-10%. A better configured .NET data provider improved this situation. We adapted the .Net data provider to support the same timeout as the MS SQL Server standard timeout (60s), and we now meticulously scan our logs for errors and discard all results that have any error rates.

Since turbo modes are disabled in the "Balanced" Power management policy and we want to evaluate both power and performance, we tested with both the "Balanced" and "High Performance" power management policies.

MS SQL Server 2008

We have reported this before: when it comes to pure OLAP MS SQL server throughput, the Opteron Magny-Cours is unbeatable. Notice that both the Xeon and the new Opteron 6276 can hardly leverage Turbo (e.g. "High Performance" mode) even though both Intel and AMD talk about the potential to run at higher clockspeeds even when all cores are active. While this integer intensive workload comes nowhere close to consuming what a Linpack run would require, the TDP headroom is no longer there to enable a clockspeed boost.

Looking at the results, the Opteron 6276 disappoints somewhat as it is not capable of outperforming its older brother. However, it performs relatively close to the Xeon and is thus far from a dud.

At maximum CPU load, the response times paint a very similar picture as the throughput numbers:

MS SQL Server 2008

When we look at the response times, the Opteron 6174's leadership is confirmed and emphasized. The fact that a 16-cluster 2.3GHz Opteron offers 30% more throughput and twice as fast response times as its 8-cluster 3GHz brother is a clear testimony to the excellent scaling capabilities of SQL server.

Since performance/watt is an extremely important metric, we follow up with a power measurement:

MS SQL Server 2008

There's no doubt about it: the Opteron 6174 is the performance/watt champion in this particular task.

This is the classical way to evaluate server performance, but should you base your purchase on these numbers alone? Frankly, no. The 100% full load evaluation is incomplete, and it's not related to the real world way of using a database. It just shows what your servers can spit out when running at their maximum, a situation most people either try to avoid or never see as so many other bottlenecks (I/O, lock contention) kick in before you see 100% CPU utilization. It is the best method to evaluate HPC machines, but a short-sighted method for almost any server application (web, database, etc.).

In other words, server benchmarks at 100% are just one datapoint, but we should test at lower concurrencies as well. That is why we simulated 40, 80, 100, 125, 200, 300, 400, 600 and 800 users with our vApus stress testing client. Each user starts a query with between 900 and 1100 ms "thinking time" (so on average 1 second). At 600 and 800 users, our servers achieve their maximum throughput, but how do these servers handle "low" and "midrange" workloads? Let us see what happened when we tested with everyday "normal" loads.

MS SQL Server 2008 R2 at Medium Load

First we test at a moderate load (20-40% load) with 125 concurrent users. Note that these numbers are one set of results from our testing a complete chain of concurrencies (25, 40, 80, 100, 125, 200, 300 ... concurrent users). You will see the complete listing of the results later in the review. For this overview, we focus on a specific concurrency. The database is "warmed up" with a test using 25 concurrent users. We always discard the result at 25 concurrent users as you see some disk I/O peaks at the first concurrency.

We don't look at the throughput numbers here as all servers deliver somewhere between 117 and 122 queries per second as we only demand 125 queries per second. Instead, we focus on response times.

MS SQL Server 2008

Here the response times are very interesting. From the "full load" numbers, you might conclude that the Opteron 6220 is untenable, as it delivers 30-40% lower throughput while consuming just as much power as the other servers. From those same numbers, we would conclude that the Opteron 6174 is the server platform to get. Switch to a moderate load and our conclusions change.

When testing at medium load, we get a much more accurate and nuanced picture as your servers will probably be spending a lot of time running this kind of load. It seems that if you want to save some power (e.g. run the "Balanced" power profile), the opteron 6220 comes close to the new champion, the Xeon X5650. Since turbo is not enabled in this mode, the 6220 leverages its higher clockspeed to outperform the other Opterons.

Interestingly, the Dynamic Voltage Frequency and Voltage Scaling (DVFS) of the Opteron 6174 performs pretty badly compared to the Xeon and the new Opteron. Enabling DVFS increases the response times by 116% (!) on the older Opteron. The Xeon and Opteron 6276 also get a significant—but lower—hit in response time: +66% and +78% respectively.

The Opteron 6220 suffers much less from this problem, as response times only grow 22%. That clearly indicates that the new Opteron deals much better with DVFS. The reason why the 6276 gets such a high penalty in "Balanced" mode is probably due to the fact that it cannot boost to 2.6GHz or 3.2GHz anymore. A better adapted power policy could definitely improve performance at lower loads. We measured the impact of turbo on the power consumption and it was less than 10%. The energy (power * time) increase was even lower (a few %) as the CPU could put the cores to sleep more quickly.

If you think these kind of response times (<100 ms) don't matter, don't forget that the top 5% queries can easily show 20-50x higher response times. Those are exactly the queries the users might start to complain about.

Let us look at the power figures.

MS SQL Server 2008

As the Xeon is able to put its cores to sleep more quickly and deeply, the Xeon is a real winner in "Balanced" mode. But notice how the Opteron 6174 performance/watt is no longer attractive: it needs just as much power as Opteron 6276 in balanced mode but delivers worse response times. Meanwhile, the Opteron 6220 fails to impress; it did deliver very decent response times, but it needs 26% more power than the Xeon, which is saving a significant amount of power in "Balanced" mode.

MS SQL Server 2008 R2 at Low Load

Even in this virtualized age, lots and lots of servers are running close to idle quite a bit of time. We also checked how our servers behaved with 40 concurrent users.

MS SQL Server 2008

The CPUs with the highest single threaded performance have the advantage with the balanced power management mode, but in this situation the power consumed is a lot more important:

MS SQL Server 2008

The Opteron 6174 has no core gating and it shows: the power consumed is about 10 to 15% higher. The Xeon continues to lead in balanced mode, with clearly better response times and a small power advantage. At low load the Opteron 6220 does well, but the best Opteron remains the Opteron 6276. It offers comparable performance/watt to the Xeon in "High Performance mode" : slightly lower power consumption with slightly higher but still very respectable response times.

Threading Tricks or Not?

AMD claimed more than once that Clustered Multi Threading (CMT) is a much more efficient way to crunch through server applications than Simultaneous Multi Threading (SMT), aka Hyper-Threading (HTT). We wanted to check this, so for our next tests we disabled and enabled CMT and HTT. Below you can see how we disabled CMT in the Supermicro BIOS Setup:

First, we look at raw throughput (TP in the table). All measurements were done with the "High Performance" power policy.

Concurrency	CMT	No CMT	TP Increase CMT vs. No CMT	HTT	No HTT	TP Increase HTT vs. No HTT
25	24	24	100%	24	25	100%
40	39	39	100%	39	39	100%
80	77	77	100%	78	78	100%
100	96	96	100%	97	98	100%
125	120	118	101%	122	122	100%
200	189	183	103%	193	192	100%
300	275	252	109%	282	278	102%
350	312	269	116%	321	315	102%
400	344	276	124%	350	339	103%
500	380	281	135%	392	367	107%
600	390	286	136%	402	372	108%
800	389	285	137%	405	379	107%

Only at 300 concurrent users (or queries per second) do the CPUs start to get close their maximum throughput (around 400 q/s). At around that point is where the multi-threading technologies start to pay off.

It is interesting to note that the average IPC of one MS SQL Server thread is about 0.95-1.0 (measured with Intel vTune). That is low enough to have quite a few unused execution slots in the Xeon, which is ideal for Hyper-Threading. However, Hyper-Threading is only capable of delivering a 3-8% performance boost.

On the AMD Opteron we measured an IPC of 0.72-0.8 (measured with AMD CodeAnalyst). That should also be more than low enough to allow two threads to pass through the shared front-end without obstructing each other. While it is not earth shattering, CMT does not disappoint: we measure a very solid 24-37% increase in throughput. Now let's look at the response times (RT in the table).

Concurrency	CMT	No CMT	RT Increase (CMT vs. No CMT)	HTT	No HTT	RT Increase HTT vs. No HTT
25	29	28.5	2%*	20.4	18.9	8%*
40	31.1	32.1	-3% *	21.7	20.3	7%*
80	36	39	-9%*	24	23	2%*
100	39	46	-14%	28	25	13%
125	46	57	-20%	28	28	0%
200	59	92	-35%	38	40	-4%
300	92	189	-51%	62	79	-21%
350	121	303	-60%	91	112	-19%
400	164	452	-64%	143	182	-21%
500	320	788	-59%	278	335	-17%
600	545	1111	-51%	498	621	-20%
800	1003	1825	-45%	989	1120	-12%

* Difference between results is within error margin and thus unreliable.

The SQL server software engine shows excellent scaling and is ideal for CMT and Hyper-Threading. CMT seems to reduce the response time even at low loads. This is not the case for Hyper-Threading, but we must be careful to interpret the results. At the lower concurrencies, the response times measured are so small that they fall within the error margin. A 21.7 ms response time is indeed 7% more than a 20.3 ms response time, but the error margin of these measurements is much higher at these very low concurrencies than at the higher concurrencies, so take these percentages with a grain of salt.

What we can say is that Hyper-Threading only starts to reduce the response times when the CPU goes beyond 50% load. CMT reduces the response times much more than HTT, but the non-CMT response times are already twice (and more) as high as the non-HTT response times.

In the end, both multi-threading technologies improve performance. CMT seems to be quite a bit more efficient than SMT; however, it must be said that the Xeon with HTT disabled already offers response times that are much lower than the Opteron without CMT. So you could also argue the other way around: the Xeon already does a very good job of filling its pipelines (IPC of 1 versus 0.72), and there is less headroom available.

MS SQL Server 2008 Power Analysis

We'll let power consumption be the final judge:

Concurrency	CMT	No CMT	Power Increase CMT vs. No CMT	HTT	No HTT	Power Increase HTT vs. No HTT
25	144	141	2%	159	156	2%
40	156	160	-2%	169	163	4%
80	185	194	-5%	182	174	4%
100	211	209	1%	192	177	8%
125	223	224	0%	201	186	8%
200	250	254	-2%	235	213	10%
300	290	283	3%	277	251	10%
350	305	288	6%	299	275	9%
400	316	288	10%	303	275	10%
500	316	291	9%	314	275	14%
600	324	299	9%	312	285	9%
800	324	308	5%	320	289	11%

CMT increases the amount of power consumed by 6-10%, but only at high loads. The extra clusters probably allow the modules (as AMD likes to call the cores) to sleep more frequently at lighter loads, and we measure no increase or even a small decrease in power consumption. The message is clear: there is no reason to disable CMT when running MS SQL Server.

Hyper-Threading seems to increase the power dissipation always. At higher concurrencies, the higher performance must be paid with a 10-14% power increase, so you might consider disabling Hyper-Threading if your want to cap maximum power output for some reason (e.g. getting to close to the maximum amount of amps allowed in your rack).

MS SQL Server OLAP Conclusion

We invested 10 times more time in our MS SQL Server testing, but frankly we are glad we did. The Opteron 6174 seems to be a true champion from a simple "throughput/power at 100%" analysis, but the reality is that servers hardly ever run at such loads. Under light loads, the Opteron 6174 is either slower and consumes more power (Balanced power setting) or it consumes quite a bit more (High Performance power setting) while being roughly on par with the competition in terms of performance. At medium load, the Opterons are beaten solidly by the Xeon; the Xeon consumes quite a bit less power in "Balanced" and performs a lot better (response times).

At the end of the day, the Xeon X5650 is the better chip (especially in "Balanced" mode) but it's also the more expensive one. The Opteron 6276 price/performance/watt ratio remains quite attractive, but if pricing is taken into account everything will depend on which MS SQL Server License you will get. We will leave that analysis to other people as an economic analysis of complex, customer unfriendly licensing is definitely out of the scope of this article.

MySQL 5.5.17 "Percona Server"

Many readers asked us why we only tested MySQL in a virtualized environment and not on "native" Linux. Indeed, it has been years since we tested MySQL "natively". The reason is simple: MySQL 5.1 and earlier versions scaled pretty badly beyond 4-8 cores, so there is no incentive to run them on modern dual socket servers. However, starting in December 2010, MySQL 5.5 has been available and it should feature much improved scalability. Even better, the people at Percona released their version of MySQL and the Innodb Storage Engine, Percona Server with XtraDB. This MySQL/Innodb combination is engineered for even better scalability.

To test this, we installed Percona Server 5.5.17-55 (Release 22.1, November 2011) on top of a Ubuntu 11.10 x86-64 linux with the 3.0.0-14 kernel. This kernel was the latest stable version at the time and is "Bulldozer/Interlagos aware".

We migrated the "Nieuws.be" database to MySQL to have a test similar to our SQL server test. That migration is not perfect as not all stored procedures were successfully converted, so you should not use the benchmark results below to compare SQL Server and MySQL. However, the profile of the test is the same: it is 99% complex selects that scan large parts of the database. The database is several tens of gigabytes instead of one.

MySQL Sysbench

The results are abysmal for the latest Interlagos Opteron. The best Xeon score is 84% better than the best Opteron score. The results indicate what went wrong: the 8 thread Opteron 6220 at 3GHz scores better than the 16 thread Opteron 6276 at 2.3GHz. A clockspeed advantage of 30% has prevailed over twice as many threads. So we can suspect that the scaling problems are not gone, at least in this test.

Let us take a closer look by performing the same test on a different number of threads and cores. The BIOS of the SuperMicro H8DGU-F allowed us to disable the second integer unit or one or more modules of the new Opterons. (Disabling both at the same time was not possible.) The Asus Z8PS-D12-1U was more flexible: we could disable Hyper-Threading and/or several cores of the Xeon. Here are the scaling results.

MySQL

First, we focus on the results with few cores and threads. Two Bulldozer Modules are capable of slightly outperforming four cores of the Opteron Magny-Cours. The ideas behind Bulldozer are sound: two modules are smaller (157 mm²) and more power efficient than four K10 cores (231 mm²). At the same time they perform equal to the Xeon X5650—which is clocked higher—with the same amount of threads. At eight threads this is still the case, and the gap between the newer and older Opteron widens in favor of the former.

Beyond eight threads, the new Opteron starts to scale badly. Doubling the number of modules to eight delivers a very small 5% performance advantage. Double the number of modules again and you end up with negative scaling. To make matters worse, the Xeon doesn't have this problem. From eight to 16 threads we get a 76% performance boost. The end result is that a quad-core Xeon beats the best Opteron by a large margin. Let us investigate the matter further.

MySQL OLAP Analyzed

Since threads waiting for mutex (semaphores) to complete were killing scaling on the old MySQL/Innodb, we wrote a bash script to monitor the right lines in the complex and long listing of "SHOW ENGINE INNODB STATUS \G" command. The show status commands are rather hard to interpret as they simply reveal the current status of the counters and show few ratios, so we sampled the output of the status command every second to measure the amount of spin waits per second. This gave us some very interesting data points.

Mutex Spin Waits per second

A few thousands of spin waits isn't anything to worry about, but a spin wait inherently wastes a small amount CPU cycles, and a few hundreds of thousands of them will waste a lot of CPU cycles without doing any useful computational work. At 200 concurrent users, five times (!) more spin waits are happening on the Opteron server than on the Xeon server. Since the Opteron has a slightly higher clockspeed than the Xeon (3GHz vs. 2.66GHz), chances are high that the Opteron's micro-architecture is a lot less efficient in handling the mutex. We will try to profile this and report back in our next article. But this already explains a lot: as the core count goes up, more threads are launched to take advantage of this, but more threads means that locking contention plays an increasingly important role.

A CPU that handles mutex and locking in general slowly can get an even heavier performance penalty in MySQL with rapidly increasing amounts of context switches. If the spin wait spins too many times (too many "rounds"), it is put to sleep by the OS and put in a wait array. The context switch associated with this operation allows a new thread to run on the CPU but costs tens of thousands CPU cycles. So high amount of spin wait rounds and context switches waste a lot of CPU cycles and power. The end result is that spin locks and high context switch rates are having a devastating effect on the Opteron's power consumption. We integrated the Racktivity energy monitor into our vApus stress testing client. The Opteron is the brown line, the Xeon the blue line.

The dips are the periods between two concurrencies. The first bulge is 100 concurrent users, the second 200, and the third 300. As you can see the Opteron is running constantly at full throttle while the Xeon only spikes from time to time. The huge amount of spin locks is keeping the Opteron cores working hard, while the Xeon cores can take lots of breaks. The Opteron is running almost constantly at 311W (at 200 and 300 concurrent users) while the the Xeon runs at 190W with spikes up to 270W. The delta between the surfaces below the lines is the amount of energy consumed, which is huge.

Another way to make the issue clear is to look at the spin rounds per wait. This value shows the number of spin lock rounds per OS wait for a mutex.

Mutex Spin Rounds per Wait

Notice that as the core count goes up and the single threaded performance goes down, the amount of spin lock grounds goes up. There is more to it than just "single threaded" performance as we can assume that a Opteron 6174 core at 2.2GHz is not faster than a Opteron 6220 core at 3GHz. In fact at low thread counts, even the Opteron 6276 edged out the Opteron 6174. So as long as the impact of locking contention is not too high, the Opteron 6200 does fine. Once locking contention is determining the performance, it is clear that even the old Magny-Cours architecture handles mutexes a lot better.

The Opteron is clearly not the only one to blame as apparantely MS SQL server has less internal contention problems than the MySQL/Innodb combination. We might also be able to tweak Innodb to lower the impact by changing the Innodb_sync_spin_loops variable and reducing the amount of threads running. However, handling semaphores quickly is important in many if not most server workloads. As long as we don't find a way around it, the Opteron 6200 is not an option for any heavily used MySQL OLAP server.

MySQL OLTP

Currently we don't have a good transactional (OLTP) benchmark that works with our vApus stress test client, so we went back to the MySQL Sysbench utility. Sysbench allows us to place an OLTP load on a MySQL test database, and you can chose the regular test or the read only test. We chose read only as even with several SSDs our benchmark remained disk I/O limited. Our current Promise JBOD does not work with SATA SSDs, so we can only use the three remaining SATA interfaces in our supermicro server.

The read only setting makes the test less real world, but a Sysbench test is rather synthetic anyway. The main reason why we tested with Sysbench is to get a huge amount of queries that only select very small parts (a few or one row) of the tables, so we can see how our platforms behave in this kind of scenario. And the results were very different from our OLAP benchmarks.

Since we could not use the capabilities of our vApus client, we were not able to perform an in-depth analysis like we did on the MS SQL Server tests. Yes, Sysbench allows you to test with any number of threads you like, but there is no "think time" feature. That means all queries fire off as quickly as possible, so you cannot simulate "light" and "medium" loads.

The response times are very small, which is typical for an OLTP test. To take them into account, we are showing you the highest throughput at around 3 ms (2.8 ms to 3.3 ms). We tested with 1 million records, but 10 million records gave very similar results.

MySQL Sysbench Read Only

The Intel X5650 gets a 30% boost from SMT, which is more or less equal to adding two extra cores (compare Xeon X5650, which is a hex-core, and the E5640 quad-core). This shows that this benchmark scales well over more cores, threads, or clusters.

The second integer cluster inside the new Opteron offers 40% more performance. So once again, CMT does the job. The Opteron 6276 does well but does not really break away from the pack. For example, if we take the small clockspeed advantage of the 6276 into account, the new Opteron is hardly faster than its predecessor.

How about power? We didn't test all configurations but the Xeon X5650, Opteron 6174 and Opteron 6176 are in the same league. The huge increase in TPC-c performance that AMD touts is for a significant part the result of using better SSDs, but we estimate that the new chip is about 20-30% faster that the previous one. The new Opteron appears very capable when it comes to OLTP.

SAP S&D Benchmark

The SAP SD (sales and distribution, 2-tier internet configuration) benchmark is an interesting benchmark as it is a real world client-server application contrary to many server benchmark (such as SpecJBB, SpecIntRate, etc.). We looked at SAP's benchmark database for these results. The results below all run on Windows 2008 and MS SQL Server 2008 database (both 64-bit).

Every 2-tier Sales & Distribution benchmark was performed with SAP's latest ERP 6 enhancement package 4. These results are NOT comparable with any benchmark performed before 2009. We analyzed the SAP Benchmark in-depth in one of our earlier articles. So far, our profile of the benchmark shows:

Very parallel resulting in excellent scaling
Likes large caches (memory latency)
Very sensitive to sync ("cache coherency") latency
Low IPC
Branch memory intensive code

We managed to get even better profiling of the benchmark. IPC is as low as 0.5 (!) on the most modern Intel CPU architectures. About 48% of the instructions are loads and stores and 18% are branches. One percent of those branches is mispredicted, so the branch misprediction ratio is slightly higher than 5% on modern Intel cores.

Especially the instruction cache is hit hard, and the hit rate is typically a lot lower than in other applications (probably 10% misses and lower). Even the large L3 caches are not capable of satisfying all requests. The SAP SD benchmarks needs between 10-30GB/s, depending on how aggressive the prefetchers are.

SAP Sales & Distribution 2 Tier benchmark

SAP is one of the benchmarks that scale very well and it is shows: the server CPUs with the highest thread count are on top. We remember from older benchmarks that enabling Hyper-Threading (on Nehalem and later) boosts SAP's performance by 35%. As the IPC of a single SAP thread is relatively low (0.5 and lower), the decoding front end of the Bulldozer core should be able to handle this easily. Therefore, the extra integer cluster on the Opteron can really do its magic.

We don't have any Xeon X5650 benchmarks, but a quick calculation tells us that the new Opteron 6276 should be about 20% faster than the X5650. It is also about 18% faster, clock for clock, than the older Opteron 6176. The new Opteron does well here.

Making Sense of the New Interlagos Opteron

This second look at the current Xeon and Opteron platforms added OLAP, ERP, and OLTP power and performance data. Combine this with our first review and the other publicly available benchmark and power data and we should be able to evaluate the new Opteron 6200 more accurately. So in which situations does the Opteron 6200 make sense? We'll start with the perspective of the server buyer.

Positioning the Opteron 6276

First let's look at the pricing. The Opteron 6276 is priced similar to an E5649, which is clocked 5% lower than the X5650 we tested. If you calculate the price of a Dell R710 with the Xeon E5649 and compare it with a Dell R715 with the Opteron 6276 with similar specs, you end up more or less the same acquisition cost. However, the E5649 is an 80W TDP and should thus consume a bit less power. That is why we argued that the Opteron 6276 should at least offer a price/performance bonus and perform like an X5650. The X5650 is roughly $220 more expensive, so you end up with the dual socket Xeon system costing about $440 more. On a fully speced server, that is about a 10% price difference.

The Opteron 6276 offered similar performance to the Xeon in our MySQL OLTP benchmarks. If we take into account the hard to quantify TPC-C benchmarks, the Opteron 6276 offers equal to slightly better OLTP performance. So for midrange OLTP systems, the Opteron 6276 makes sense if the higher core count does not increase your software license. The same is true for low end ERP systems.

When we look at the higher end OLTP and the non low end ERP market, the cost of buying server hardware is lost in the noise. The Westmere-EX with its higher thread count and performance will be the top choice in that case: higher thread count, better RAS, and a higher number of DIMM slots.

AMD also lost the low end OLAP market: the Xeon offers a (far) superior performance/watt ratio on mySQL. In the midrange and high end OLAP market, the software costs of for example SQL Server increase the importance of performance and performance/watt and make server hardware costs a minor issue. Especially the "performance first" OLAP market will be dominated by the Xeon, which can offer up to 3.06GHz SKUs without increasing the TDP.

The strong HPC performance and the low price continue to make the Opteron a very attractive platform for HPC applications. While we haven't tested this ourself, even Intel admits that they are "challenged in that area".

The Xeon E5, aka Sandy Bridge EP

There is little doubt that the Xeon E5 will be a serious threat for the new Opteron. The Xeon E5 offers for example twice the peak AVX throughput. Add to this the fact that the Xeon will get a quad channel DDR3-1600 memory interface and you know that the Opteron's leadership in HPC applications is going to be challenged. Luckily for AMD, the 8-core top models of the Xeon E5 will not be cheap according to leaked price tables. Much will depend on how the 6-core midrange models fare against the Opteron.

The Hardware Enthusiast Point of View

The disappointing results in the non-server applications is easy to explain as the architecture is clearly more targeted at server workloads. However, the server workloads show a very blurry picture as well. Looking at the server performance results of the new Opteron is nothing less than very confusing. It can be very capable in some applications (OLTP, ERP, HPC) but disappointing in others (OLAP, Rendering). The same is true for the performance/watt results. And of course, if you name a new architecture Bulldozer and you target it at the server space, you expect something better than "similar to a midrange Xeon".

It is clear to us that quite a few things are suboptimal in the first implementation of this new AMD architecture. For example, the second integer cluster (CMT) is doing an excellent job. If you make sure the front end is working at full speed, we measured a solid 70 to 90% increase in performance enabling CMT (we will give more detail in our next article). CMT works superbly and always gives better results than SMT... until you end up with heavy locking contention issues. That indicates that something goes wrong in the front end. The software applications that do not scale well could be served well with low core count "Valencia" Opteron 4200s, but when we write this, the best AMD could offer was a 3.3GHz 6-core. The architecture is clearly capable of reaching very high clockspeeds, but we saw very little performance increase from Turbo Core.

What we end up with then is more questions. That means it's time for us to do some deep profiling and see if we can get some more answers. Until then, we hope you've enjoyed our second round of Interlagos benchmarking, and as always, comments and feedback on our testing methods are welcome.

The Opteron 6276: a closer look

Log in

Don't have an account? Sign up now