Original Link: https://www.anandtech.com/show/2956/10gbit-ethernet-killing-another-bottleneck-



In the second quarter of this year, we’ll have affordable servers with up to 48 cores (AMD’s Magny-cours) and 64 threads (Intel Nehalem EX). The most obvious way to wield all that power is to consolidate massive amounts of virtual machines on those powerhouses. Typically, we’ll probably see something like 20 to 50 VMs on such machines. Port aggregation with a quad-port gigabit Ethernet card is probably not going to suffice. If we have 40 VMs on a quad-port Ethernet, that is less than 100Mbit/s per VM. We are back in the early Fast Ethernet days. Until virtualization took over, our network intensive applications would get a gigabit pipe; now we will be offering them 10 times less? This is not acceptable.

Granted, few applications actually need a full 1Gbit/s pipe. Database servers need considerably less, only a few megabits per second. Even at full load, the servers in our database tests rarely go beyond 10Mbit/s. Web servers are typically satisfied with a few tens of Mbit/s, but AnandTech's own web server is frequently bottlenecked by its 100Mbit connection. Fileservers can completely saturate Gbit links. Our own fileserver in the Sizing Servers Lab is routinely transmitting 120MB/s (a saturated 1Gbit/s link). The faster the fileserver is, the shorter the waiting time to deploy images and install additional software. So if we want to consolidate these kinds of workloads on the newest “über machines”, we need something better than one or two gigabit connections for 40 applications.

Optical 10Gbit Ethernet – 10GBase-SR/LR - saw the light of day in 2002. Similar to optical fibre channel in the storage world, it was very expensive technology. Somewhat more affordable, 10G on “Infiniband-ish” copper cable (10GBase-CX4) was born in 2004. In 2006, 10Gbit Ethernet via UTP cable (10GBase-T) held the promise that 10G Ethernet would become available on copper UTP cables. That promise has still not materialized in 2010; CX4 is by far the most popular copper based 10G Ethernet. The reason is that the 10GBase-T PHYs need too much power. The early 10GBase-T solutions needed up to 15W per port! Compare this to the 0.5W that a typical gigabit port needs, and you'll understand why you find so few 10GBase-T ports in servers. Broadcom reported a breakthrough just a few weeks ago: Broadcom claims that their newest 40nm PHYs use less than 4W per port. Still, it will take a while before the 10GBase-T conquers the world, as this kind of state-of-the art technology needs some time to mature.

We decided to check out the some of the more mature CX4-based solutions as they are decently priced and require less power. For example, a dual-port CX4 card goes as low as 6W… that is 6W for the controller, two ports and the rest of the card. So a complete dual-port NIC needs considerably less than one of the early 10GBase-T ports. But back to our virtualized server: can 10Gbit Ethernet offer something that the current popular quad-port gigabit NICs can’t?

Adapting the network layers for virtualization

When lots of VMs are hitting the same NIC, quite a few performance problems may arise. First, one network intensive VM may completely fill up the transmit queues and block the access to the controller for some time. This will increase the network latency that the other VMs see. The hypervisor has to emulate a network switch that sorts and routes the different packets of the various active VMs. Such an emulated switch costs quite a bit of processor performance, and this emulation and other network calculations might all be running on one core. In that case, the performance of this one core might limit your network bandwidth and raise network latency. That is not all, as moving data around without being able to use DMA means that the CPU has to handle all memory move/copy actions too. In a nutshell, a NIC with one transmit/receive queue and a software emulated switch is not an ideal combination if you want to run lots of network intensive VMs: it will reduce the effective bandwidth, raise the NIC latency and increase the CPU load significantly.


Without VMDQ, the hypervisor has to emulate a software switch. (Source: Intel VMDQ Technology)
 

Several companies have solved this I/O bottleneck by making use of the multiple queues". Intel calls it VMDq; Neterion calls it IOV. A single NIC controller is equipped with different queues. Each receive queue can be assigned to a virtual NIC of your VM and mapped to the guest memory of your VM. Interrupts are load balanced across several cores, avoiding the problem that one CPU is completely overwhelmed by the interrupts of tens of VMs.


With VMDq, the NIC becomes a Layer 2 switch with many different Rx/Tx queues. (Source: Intel VMDQ Technology)
 

When packets arrive at the controller, the NIC’s Layer 2 classifier/sorter sorts the packets and places them (based on the virtual MAC addresses) in the queue assigned to a certain VM. Layer 2 routing is thus done in hardware and not in software anymore. The hypervisor looks in the right queue and then routes those packets towards the right VM. Packets that have to go out of your physical server are placed in the transmit queues of each VM. In the ideal situation, each VM has its own queue. Packets are sent to the physical wire in a round-robin fashion.

The hypervisor has to support this and your NIC vendor must of course have an “SR-IOV” capable driver for the hypervisor. VMware ESX 3.5 and 4.0 have support for VMDq and similar technologies, calling it “NetQueue”. Microsoft Windows 2008 R2 supports this too, under the name “VMQ”.



Benchmark Configuration

We used a point-to-point configuration to eliminate the need for a switch. We have one machine that we use as the other “end of the network” and one machine on which we measure throughput and CPU load. We used Ixia IxChariot to test the network performance.

Server One ("the other end of the network"):
Supermicro SC846TQ-R900B chassis
Dual Intel Xeon 5160 “Woodcrest” at 3GHz
Supermicro X7DBN Rev1.00 Motherboard
Intel 5000P (Blackford) Chipset
4x4GB DDR2-667 Kingston Value Ram CAS 5
BIOS version 03/20/08

Server two (for measurements):
Supermicro A+ 2021M-UR+B chassis
Dual AMD Opteron 8389 “Shanghai” at 2.9GHz
Supermicro H8DMU+ Motherboard
NVIDIA MCP55 Pro Chipset
8x2GB of Kingston DDR2-667 Value RAM CAS 5
BIOS version 080014 (12/23/2009)

NICs

Both servers were equipped with the following NICs:

  • Two dual-portIntel PRO/1000 PT Server adapter (82571EB) (four ports in total)
  • One Supermicro AOC-STG-I2 dual-port 10Gbit/s Intel 82598EB
  • One Neterion Xframe-E 10Gbit/s

We tested the NICs using CentOS 5.4 x64 Kernel 2.6.18 and VMware ESX 4 Update 1

Important note:the NICs used are not the latest and greatest. For example, Neterion already has a more powerful 10Gbit NIC out, the Xframe 3100. We tested with what had available in our labs.

Drivers CentOS 5.4
Neterion Xframe-E: 2.0.25.1
Supermicro AOC-STG-I2 dual-port: 2.0.8-k3, 2.6.18-164.el5

Drivers ESX 4 Update 1 b208167
Neterion Xframe-E: vmware-esx-drivers-net-s2io-400.2.2.15.19752-1.0.4.00000
Supermicro AOC-STG-I2 dual-port: vmware-esx-drivers-net-ixgbe-400.2.0.38.2.3-1.0.4.164009
Intel PRO/1000 PT Server adapter: vmware-esx-drivers-net-e1000e-400.0.4.1.7-2vmw.1.9.208167



The Hardware

As always, we worked with the hardware we have available in the labs.

The Neterion Xframe-E is neither Neterion’s latest nor greatest. It came out in 2008 and was one of the first PCI Express NICs that supported VMware’s NetQueue feature in ESX 3.5, so we felt this pioneer in network virtualization should be included. The latest Neterion NICs are the X3100 series, but those cards were not available to us. The Xframe-E has eight transmit and eight receives queues available, TCP checksum offload, and TCP segment send/receive offload. The card uses a PCIe 1.1 x8 connector. Typical power consumption is around 12W. More info here. Our Neterion Xframe-E used optical SR multi-mode fiber, but a CX4 version is also available.

The Supermicro AOC-STG-I2 was also first made available in 2008, but we got the version that was sold in 2009. It is based on the Intel 82598EB chip. It supports checksum offloading and even iSCSI booting. 16 virtual queues are available. Power consumption should not be higher than 6.5W. More info here.

Native performance

We first tested on the Linux distribution CentOS 5.4, based on the 2.6.18 kernel. The first goal was to understand whether 10Gbit cards make sense for non-virtualized platforms.

NIC native Linux performance

The Neterion needed 6% of our 8 2.9GHz Opteron cores, the Intel chip on our Supermicro card only 2.5%. Despite these low percentages, both cards cannot reach half their potential. These rather expensive cards do not seem to be very attractive next to a simple quad-port 1Gbit/s Ethernet card.



Network Performance in ESX 4.0 Update 1

We used up to eight VMs, and each was assigned an “endpoint” in the Ixia IxChariot network test. This way we could measure the total network throughput that is possible to achieve with one, four or eight VMs. Making use of ESX NetQueue, the cards should be able to leverage their separate queues and the hardware Layer 2 “switch”.

First, we test with NetQueue disabled. The cards will behave like a card with only one Rx/Tx queue. To make the comparison more interesting, we added two dual-port gigabit NICs into the benchmark mix. Teamed NICs are currently by far the most cost effective way to increase network bandwidth.

NIC performance on ESX 4.0, no NetQueue

The 10G cards show their potential. With four VMs, they are able to achieve 5 to 6Gbit/s. There is clearly a queue bottleneck: both 10G cards perform worse with eight VMs. Notice also that 4x1Gbit does very well. This combination has more queues and can cope well with the different network streams. Out of a maximum line speed of 4Gbit/s, we achieve almost 3.8Gbit/s with four and eight VMs. Now let's look at CPU load.

CPU load on ESX 4.0, no NetQueue

Once you need more than 1Gbit/s, you should pay attention to the CPU load. Four gigabit ports or one 10G port require 25~35% utilization of eight 2.9GHz Opterons cores. That means that you would need two or three cores dedicated just to keeping the network pipes filled. Let us see if NetQueue can do some magic.

NIC performance on ESX 4.0, NetQueue enabled

The performance of the Neterion card improves a bit, but it's not really impressive (+8% in the best case). The Intel 82598 EB chip on the Supermicro 10G NIC is now achieving 9.5Gbit/s with eight VMs, very close to the theoretical maximum. The 4x1Gbit/s NIC numbers were repeated in this graph for reference (no NetQueue was available).

So how much CPU power did these huge network streams absorb?

CPU load on ESX 4.0, NetQueue enabled

The Neterion driver does not seem to be optimized for ESX 4. Using NetQueue should lower CPU load, not increase it. The Supermicro/Intel 10G combination shows the way. It delivers twice as much bandwidth at half the CPU load compared to the two dual-port gigabit NICs.



Delving Deeper

Let us take a closer look at the Neterion and Intel 10G chips configuration on VMware’s vSphere/ESX platform. First, we checked what the S2IO driver of Neterion did when ESX was booting.

If you look closely, you can see that eight Rx queues are recognized, but only one Tx queue. Compare this to the Intel ixgbe driver:

Eight Tx and Rx queues are recognized, one for each VM. This is also confirmed when we start up the VMs. Each VM gets its own Rx and Tx queue. The Xframe-E has eight transmit and eight receive paths, but it seems that for some reason the driver is not able to use the full potential of the card on ESX 4.0.

Conclusion

The goal of this short test was to discover the possibilities of 10 Gigabit Ethernet in a virtualized server. If you have suggestion for more real world testing, let us know.

CX4 is still the only affordable option that comes with reasonable power consumption. Our one-year-old dual-port CX4 card consumes only 6.5W; a similar 10GBase-T solution would probably need twice as much. The latest 10GBase-T (4W instead of >10W per port) advancements are very promising, as we might see power efficient 10G cards with CAT-6 UTP cables this year.

The Neterion Xframe-E could not fulfill the promise of near 10Gbit speeds at low CPU utilization, but our test can only give a limited indication. It is rather weird, as the card we tested was announced as one of the first to support NetQueue in ESX 3.5. We can only guess that driver support for ESX 4.0 is not optimal (yet). The Xframe X3100 is Neterion’s most advanced product and the spec sheet emphasizes its VMware NetQueue support. Neterion ships mostly to OEMs, so it is hard to get an idea of the pricing. When you spec your HP, Dell or IBM server for ESX 4.0 virtualization purposes, it is probably a good idea to check if the 10G Ethernet card is not an older Neterion card.

At a price of about $450-$550, the Supermicro AOC-STG-I2 dual-port with the Intel 82598EB chip is a very attractive solution. Typically, a quad-port gigabit Ethernet solution will cost you half as much, but it delivers only half the bandwidth at twice the CPU load in a virtualized environment.

In general, we would advise going with link aggregation of quad-port gigabit Ethernet ports in native mode (Linux, Windows) for non-virtualized servers. For heavily loaded virtualized servers, 10Gbit CX4 based cards are quite attractive. CX4 uplinks cost about $400-$500; switches with 24 Gbit RJ-45 ports and two CX4 uplinks are in the $1500-$3000 range. 10Gbit is no longer limited to the happy few but is a viable backbone technology.
 
This article would not have been possible without the help of my colleague Tijl Deneut.

Log in

Don't have an account? Sign up now