Name: VMmark Scores Investigated: should VMmark be part of your hardware decisions?
Item: VMmark Scores Investigated: should VMmark be part of your hardware decisions?
Author: Johan De Gelas

Original Link: https://www.anandtech.com/show/2763

VMmark Scores Investigated: should VMmark be part of your hardware decisions?

VIEW ARTICLE

by Johan De Gelas on May 8, 2009 12:00 AM EST

Posted in
IT Computing

23 Comments

Introduction

While the SPEC Virtualization Committee is slowly but steadily developing a new virtualization benchmark called "SPECvirt_sc2009", VMmark remains the only industry standard benchmark to compare hardware for virtualization as all the tier-one hardware vendors have published results for this benchmark. Why would you, the IT professional, care about yet another benchmark? Some professionals will probably point towards the lack of impact (and interest) that TPC and other industry standard benchmarks make when it comes to purchasing hardware decisions, as "performance is only a small part of the decision". We could not agree more when it comes to many industry benchmarks, but a virtualization benchmark is a special case.

We have said this before: virtualization is the killer application of the first decade of the 21^st century. More than half of the servers sold today will end up being host for quite a few virtual machines. Even better, it is a market that is expanding even in these times of economic crisis. From the 591 IT professionals that took the Data Center Decisions 2008 Purchasing Intentions survey, only 2% indicated that the budget for virtualization related purchases would decrease, while 56% indicated that this budget would increase. That means scoring well in the VMmark benchmark is starting to get crucial for the server hardware vendors.

Higher virtualization performance means higher consolidation ratio, and thus translates in most cases in immediate cost savings: you need fewer servers, which allows you to save on energy, hardware costs, and manpower to maintain and install those servers. Performance is a much more important factor in the decision process when it comes to virtualizing the datacenter. So while a higher TPC or SPECjbb score will seldom result in happier users and cost savings, a higher VMmark score should almost always do that… in theory.

Understanding the VMmark Score

Before we try to demystify the published VMmark scores, let me state upfront that the VMmark benchmark has it flaws, but we know from firsthand experience how hard it is to build a decent virtualization benchmark. It would be unfair and arrogant to call VMmark a bad benchmark. The benchmark first arrived back in 2006. The people of VMware were pioneers and solved quite a few problems, such as running many applications simultaneously and getting one score out of the many different benchmarks, all with scores in different units. The benchmark results are consistent and the mix of applications reflects more or less the real world.

Let's refresh your memory: VMware VMmark is a benchmark of consolidation. It consolidates several virtual machines performing different tasks, creating a tile. A VMmark tile consists of:

MS Exchange VM
Java App VM
Idle VM
Apache web server VM
MySQL database VM
SAMBA fileserver VM

The first three run on a Windows 2003 guest OS and the last three run on SUSE SLES 10.

Now let's list the few flaws:

The six applications plus virtual machines in one tile only need 5GB of RAM. Most e-mail servers running right now will probably use 4GB or more on their own! The vast majority of MySQL database servers and java web servers have at least 2GB at their disposal.
It uses SPECjbb, which is an "easy to inflate" benchmark. The benchmark scores of SPECjbb are obtained with extremely aggressively tuned JVMs, the kind of tuning you won't find on a typical virtualized, consolidated java web server.
SysBench only works on one table and is thus an oversimplified OLTP test: it only performs transactions on one table.

Regarding our SysBench remark, as OLTP benchmarks are very hard, we also use SysBench and we are very grateful for the efforts of Alexey Kopytov. SysBench is in many cases close enough for native situations. The problem is that some effects that a real world OLTP database has on a hypervisor (such as network connections and complex buffering that requires a lot more memory management) may not show up if you run a benchmark on such an oversimplified model.

The VMmark benchmark is also starting to show its age with its very low memory requirements per server. To limit the amount of development time, the creators of VMmark also went with some industry standard benchmarks, which have been starting to lose their relevance as vendors have found ways to artificially inflate the scores. VMmark needs an update, but as VMware is involved in the SPEC Virtualization Committee to develop a new industry standard virtualization benchmark, it does not make sense to further develop VMmark.

The easiest way to see that VMmark is showing its age is in the consolidation ratio of the VMmark runs. Dual CPU machines are consolidating 8 to 17 tiles. That means a dual CPU system is running 102 virtual machines, of which 85 are actively stressed! How many dual CPU machines have you seen that even operate half that many virtual machines?

That said, we'll have to work with VMmark until something better comes up. That brings up two questions. How can you spot reliable and unreliable VMmark scores? Can you base decisions on the scores?

The VMmark Scoring Chaos

When the new Xeon based on the "Nehalem" architecture launched, there was a lot - and I am being polite - of confusion about the VMmark scores. Scores ranged from 14.22 to 23.55 for dual CPU servers based on the same Xeon X5570 2.93GHz. Look at the published and "non published" VMmark scores we found from various sources:

VMware VMmark (ESX 3.5 update 3 if no remark)

The new Xeon 5570 is between 55% and 157% faster, depending on your source. Let's try to make sense out of this. First, take the lowest score, 14.22.

Intel's own benchmarking department was very courageous to publish this score of 14.22. Intel allowed us to talk to Tom Adelmeyer and Doshi A. Kshitij, both experienced engineers. Tom is one of the creators of vConsolidate and Doshi is a principal engineer who wrote some excellent academic papers; his specialty is hardware virtualization. The answer I got was surprising: the 14.22 score was a simple "out-of-the box" VMmark run that did not make use of Hardware Assisted Paging (or EPT if you like) running on top of an ESX 3.5 update 4 beta. Remember only the full ESX 3.5 update 4 contains full support for Intel's EPT technology. Update: My mistake. ESX 3.5 update 4 has full support for Intel's X55xx CPU, but not for EPT technology. That is only available in ESX 4.0 and later. My thanks goes to Scott Drummonds (VMware) to point this out.

So the 14.22 score is not comparable to the 11.28 score of the AMD Opteron "Shanghai" as the latter is a fully optimized result. Intel's Xeon X5570 can obviously do better, but how much better? A score obtained in February in the same Intel labs, which was never published, was 17.9. It is in the right ballpark for a VMmark run that is clearly better optimized and most likely running with EPT enabled. Update: We believe this was run on early version of ESX 4.0.

On Intel's own site you can find this PDF, which in the very small print mentions a score of 19.51, obtained somewhere in march on a VMware ESX Build 140815 with DDR3-800 (Update: This seems to be a sort of "Release Candidate" of ESX 4.0). As running 13 tiles requires a lot of memory, Intel outfitted the Nehalem server with 18x4GB DDR3 DIMMs. Since this means there are three DIMMs per channel, the clock speed of the DDR3 is throttled back to 800MHz.

At the launch date of the Xeon X5570, scores above 23 were reached with VMware builds 148592, 148783, and 150817. Since the launch of VMware's vSphere 4.0, these numbers have been replaced by ESX 4.0. One of the reasons that these configurations obtain higher scores is the fact they run with two DIMMs per channel, so the DDR3 DIMMs run at 1066MHz. That is good for a boost of 5-6%, which has been confirmed by both Intel's and AMD's VMmark experts. The second and most important reason is that ESX 4.0 is used. As VMware states in some whitepapers, CPU scheduling has improved in the new ESX 4.0, especially for the Nehalem architecture with its SMT (Hyper-Threading).

Intel's server benchmarking department did a favor to the IT community by releasing the 14.22 "out-of-the-box, unoptimized, and no EPT" score. This gave us a realistic worst-case score. The benchmark showed that the newest Nehalem Xeon could outperform its competitor in even the worst conditions, adding credibility to Intel's claims. In contrast, the benchmark claims at Intel's product page are pretty shady.

A 161% performance boost over the previous generation is called an "exceptional gain", but the claim is completely overshadowed by the flawed comparison. It is simply a bad practice to compare a score obtained on - at that time - unreleased brand-new software (VMware ESX build 148592, similar to ESX 4.0) with a benchmark run on older but available software (ESX 3.5 Update 3). While there is little doubt that any server based on the Xeon X55xx is a superior consolidation platform, it is unfortunate that Intel did not inform its customers properly. Especially if you consider that a fair comparison with the Xeon 54xx and Xeon 55xx both on ESX 4.0 would probably also deliver "exceptional gains".

ESX 4.0: Nehalem Enhanced

There are strong indications that the Xeon 54xx will not gain much from running VMmark on ESX 4.0. The "CPU part" of performance improvements in the new ESX 4.0, the foundation of VMware's vSphere, seems to have focused on the new Nehalem Xeon. That is hardly a surprise, as the previous Xeon platforms have quite a few disadvantages as a high performance virtualization platform: limited bandwidth, no hardware assisted paging, and power hungry FB-DIMMs.

The new Nehalem Xeon and Shanghai Opteron are clearly the virtualization platforms of the future with hardware-assisted paging, loads of bandwidth, and all kinds of power saving features. Nevertheless, ESX 4.0 really focused on the newest Xeon 55xx series. Even the newest Opterons gain very little from running on ESX 4.0 instead of ESX 3.5:

The quad Opteron 8389SE at 3.1GHz performs about 9% faster on ESX 4.0 compared to the Opteron 8384 at 2.7GHz on ESX 3.5 update 3. In other words, the performance gain seems to be solely coming from the 15% higher clock and not some hypervisor improvements. That is pretty bad news for AMD. Comparing the 11.28 VMmark of the Opteron 2384 obtained on ESX 3.5 update 3 to the ones of the Xeon x5570 on ESX 4.0 (>22) is unfair, but it does not seem like that score will improve a lot. Even worse, the newest dual Xeon servers outperform the best quad Opteron platforms when both are running the newest hypervisor from VMware.

AMD told us they expect that the VMmark scores of the Opterons will be 5 to 10% higher on ESX 4.0. Although the Quad socket numbers do not reflect this, it is a plausible claim. ESX 4.0 takes into account the processor cache architecture to optimize CPU usage and has a finer grained locking mechanism. These improvements should benefit all CPUs.

Back to Reality

While Intel's claims on the Xeon 55xx product page are based on a flawed comparison, the newest VMmark data suggests that the "Nehalem Xeon" is indeed more than twice as fast than the older Xeons and (almost) twice as fast as the newest Opterons when running on ESX 4.0. We expect the Dual Opteron "Shanghai" 8389 at 2.9GHz to achieve a score between 12 and 13.5 on ESX 4.0, while the typical score for the Xeon X5570 is around 20-23.5 depending on the clock speed of the DDR3 modules.

That tells the ICT professional that the Xeon X5570 is a CPU with the potential to run extremely high amounts of VMs, but not much more. The real world value of VMmark is highly debatable as it is showing its age:

How many of us are running more than 50 to 100 VMs, which need on average only 1GB per VM? It sounds like desktop virtualization, but it is supposed to represent server virtualization.
Would there be any real world java application that shows the same performance profile as SPECjbb? No I/O, no shared data between the different threads, and "SPECjbb-only" JVM optimizations?
How close is SysBench, which perform all its transactions on a monolithic table, to a real OLTP database running on top of a hypervisor?

The problem is that VMmark is only one data point, which hardly reflects any real world scenarios IT professionals currently use. As a result, VMmark is yet another industry benchmark where the experts of the large OEMs create unrealistically high scores with expensive SAN configurations. It's still interesting but hardly relevant for the real world. It is time for a new data point. The more virtualization scenarios tested the better. Just give us a few more days….

VMmark Scores Investigated: should VMmark be part of your hardware decisions?

Log in

Don't have an account? Sign up now