we do use 10GbE at work, and i passed a long time finding the right solutiom
- CX4 is outdated, huge cable, short length power hungry
- XFP is also outdated and fiber only
- SFP + is THE thing to get. very long power, and can used with copper twinax AS WELL as fiber. you can get a 7m twinax cable for 150$.
and the BEST card available are Myricom very powerfull for a decent price.
Be aware that VMDq is not SR-IOV. Yes, VMDq and NetQueue are methods for splitting the data stream across different interrupts and cpus, but they still go through the hypervisor and vSwitch from the one PCI device/function. With SR-IOV, the VM is directly connected to a virtual PCI function hosted on the SR-IOV capable device. The hypervisor is needed to set up the connection, then gets out of the way. This allows the NIC device, with a little help from an iommu, to DMA directly into the VM's memory, rather than jumping through hypervisor buffers. Intel supports this in their 82599 follow-on to the 82598 that you tested.
Regarding the 10Gb performance on native Linux, I have tested Intel 10Gb (the 82598 chipset) on RHEL 5.4 with iperf/netperf. It runs at 9.x Gb/s with a single port NIC and about 16Gb/s with a dual-port NIC. I just have a little doubt about the Ixia IxChariot benchmark since I'm not familiar about it.
I was surprised by the poor native Linux results as well. I got > 9 Gbit/s with Broadcom NetXtreme using nuttcp as well. I don't recall whether multiple threads were required to achieve those numbers. I don't think they were, but perhaps using a newer kernel helped, the Linux networking stack has improved substantially since 2.6.18.
So, basically what these cards are doing (figuratively speaking) they are taking in"multiplexing" 8 or 16 requests (how however many virtual queues) together into a single NIC sorting (demultiplexing) them to a respective VM the VM then takes care of the request and sends it on its way.
Yes, I think you've got it... that's pretty much how it works. At the risk of oversimplifying... these cards are like a multi-port switch with 10Gbe uplinks.
Consider a physical analog (depending on the card, and not exact but close enough): 8/16x 1Gbe ports on the server connected to a switch with 8/16x 1Gbe ports and 1/2x 10Gbe uplinks to the backbone.
Now replace that with a card on the server and 1/2x 10Gbe backbone ports. Port/switch/cable consolidation ratios of 8:1 or 16:1 can save serious $$$ (and with better/dynamic bandwidth allocation).
The typical sticking point is that 10Gbe switches/routers are still quite expensive, and unless you've got a critical mass of 10Gbe, the infrastructure cost can be a tough hump to get over.
I've got to admit that I've skimped through the article (and first page ad a half of commnts).. But it seems through your testing & numbers that you haven't used a dedicated NIC for every card in the 4x 1Gbit example (4 VMs test), otherwise you'd get lower CPU numbers simly because you skip on the load scheduling that's done on CPU.
Any "VM expert" will tell you that you have 3 basic bottlenecks in any VM server:
- RAM (the more the better, mostly not a problem)
- disks (again, more is better, and absolutele minimum is at least one drive per VM)
- NICs
For NICs basic rule would be - if VM is loaded with network-heavy application, than VM should have a dedicated NIC. CPU utilization drops heavily, and NIC utilization is higher.
Having one 10Gbit NIC shared among 8 VMs which are all bottlenecked by NICs means you have your 35% CPU load. With one NIC dedicated to each VM you'd have CPU load near zero at file-copy loads (NIC has hardware scheduler, disc controller has the same for HDDs).
Like I've said, maybe I've overlooked something in article, but it seems to me your test are based on wrong assumptions. Besides, if you've got 8 file servers as VM, you've got an unnecessary overhead as well, it's one application (file serving) so no need to virtualize to 8 VMs on same hardware.
As a conclusion, VMs are all about planning, so I believe your test had a wrong approach.
That might be the right approach when you have a few VMs on the server, but it does not seem to be reasonable when you have tens of VMs running. What do you mean by dedicating? pass-through? port grouping? Only Pass-through has near zero CPU load AFAIK, and I don't see many scenarios where pass-through is handy.
Also, if you use dedicated NICs for network intensive apps, that means that you can not use that bandwidth for the occasional spike in another "non NIC priviledged" VM.
It might not be feasible at all if you use DRS or Live migration.
The whole point of VMDQ is to offer the bandwidth necessary to the VM that needs it (for example give one VM 5 GBit/s, One VM 1 gbit/s and the others only 1 Mbit/s) and that the layer 2 routing overhead is mostly on the NIC. It seems to me that the planning you promote is very inflexible and I can see several scenario's where dedicated NICs will perform worse than one big pipe which can be load balanced accross the different VMs.
Yes, there are several scenarios where "one big" is better than several small ones, but think if 35% CPU load (and that's 35% of a very-expensive-CPU) is worth as sacrifice to have a reserve for few occasional spikes.
I do agree that putting several VMs on one NIC is ok, but that's for applications that aren't loaded with heavy network transfers. VM load balancing should be done for example like this (just a stupid example, don't hold onto it too hard):
- you have file server as one VM
- you have mail server on second VM
- you have some CPU-heavy app on separate VM
File server is heavy on networking and disc subsystem, but almost none on RAM/CPU. Mail server is dependant on several variables (antiSPAM, antivirus, amount of mailboxes & incoming mail, etc), so it can be light-to-heavy load for all subsystems. For this example let's say it's a lighter kind of load. Let's say this hardware machine has 2 NICs. You've got few CPUs with multiple cores, and plenty of disc/RAM. So what's right to do? Adding a CPU intensive VM, so that CPU isn't idle too much. You dedicate one NIC to file server, and you let mail server share NIC with CPU-intensive VM. That way file server has enough bandwidth that isn't taxing CPU to 35% cos of stupid virtual routing of great amounts of network packets, CPU is left mostly free for the CPU-intensive VM, and mail server happily lives in between the two, as it will be satisfied with leftover CPU and networking..
Now scale that to 20-30 VMs, and all you need is 10 NICs. For VMs that aren't network dependant you put them on "shared NICs", and for network-intensive apps you give those VMs dedicated NIC.
Just remember - 35% of a multi-socket & multi-core server is a huge expense, when you can do it on a dedicated NIC. NIC is, was, and will be much more cost effective for doing network packet scheduling than CPU.. Why pay several thousand $$$ for CPU if all you need is another NIC.
I hate my own typos.. 2nd sentence.. "dedicated NIC for every VM" .. not "for every card".. probably there are more nonsense.. I'm in a hurry, sorry ppl!
All the new 10G kit appears to be coming with SFP+ connectors. They can be used either with a transceiver for optical, or a pre-terminated copper cable (known as 'SFP+ Direct Attach').
CX4 seems to be deprecated as the cables are quite big and cumbersome.
I've seen some mini-Clusters (3-10 machines) lately with ethernet interconnects. Although I doubt that this is best solution, it would be nice to know how 10G ethernet actually performs in that area.
I don't find a power use of <10W for a 10Gb link such a bad compromise over 0.5W per 1Gb Ethernet link (assuming that you can use that 10Gb link at close to maximum capacity). If nothing else, you're trading two 4-port 1Gb network cards for one 10Gb card.
Suns 40BGs adapters are not terribly expensive (start at $1500.) apparently they support 8 virtual lanes? So Mellanox provides Sun their silicon. went to their site and they do have other silicon/cards that explicitly state they support Virtual Protocol Interconnect. I'm curious if this is the same thing. I know you stated that the need really isn't there but would be interesting to see if you can ask for testing samples or look into the viability of Infiniband. Looking at their partners page they provide the silicon for xsigo as a previous poster stated. Again would be nice to see if 40Gb Infiniband with and without VPI technologies is superior to 10Gb Ethernet with acceleration as you provided with us today. For SANs, anything to lower latency for iscsi is desired. Perhaps spending a little for reduced latency on the network layer makes it worth the extra price for faster transactions? So many possibilities! Thank you for all the insightful research you have provided us!
The per-port prices of 10Gbe are still $ludicrous; you're not going to be able to connect an entire vmware farm plus storage at a "reasonable" price. I'd suggest looking at infiniband:
Pros:
40Gb/s theoretical - about 25Gb/s maximum out of single stream ip traffic, or 2.5x faster than 10Gbe.
Per switch port costs of about 3x-4x times less that of 10Gbe, and comparable per adapter port costs.
Latency even lower than 10Gbe.
Able to do remote direct memory access for specialized protocols (google helps here).
Fully supported under your major operating systems, including ESX4.
Cons:
Hefty learning curve. Expect to delve into mailing lists and obscure documentations, although just the "basic" ip functionality is easy enough to get started with.
10Gbe has the familiarity concept going for it, but it is just not cost effective enough yet, where as infiniband just seems to get cheaper, faster, and lately, a lot more user friendly. Just something to consider next time :D
Thanks. Good first-order test and summary. A few more details and tests would be great, and I look forward to more on this subject...
1. It would be interesting to see what happens when the number of VM's exceeds the number of VMDQ's provided by the interface. E.g., 20-30 VM's with 16 VMDQ's... does it fall on its face? If yes, that has significant implications for hardware selection and VM/hardware placement.
2. Would be interesting to see if the Supermicro/Intel NIC can actually drive both ports at close to an aggregate 20Gbs.
3. What were the specific test parameters used (MTU, readers/writers, etc)? I ask because those throughput numbers seem a bit low for the non-virtual test (wouldn't have been surprised 2-3 years ago) and very small changes can have very large effects with 10Gbe.
4. I assume most of the tests were primarily unidirectional? Would be interesting to see performance under full-duplex load.
> "In general, we would advise going with link aggregation of quad-port gigabit Ethernet ports in native mode (Linux, Windows) for non-virtualized servers."
10x 1Gbe links != 1x 10Gbe link. Before making such decisions, people need to understand how link aggregation works and its limitations.
> "10Gbit is no longer limited to the happy few but is a viable backbone technology."
I'd say it has been for some time, as vendors who staked their lives on FC or Infiniband have discovered over the last couple years much to their chagrin (at least outside of niche markets). Consolidation using 10Gbe has been happening for a while.
"At best since it's a PCIe 1.1 x8 would be about 12Gbps per direction for a total aggregate throughput of about 24Gbps bi-directional traffic."
How are you figuring 12 Gbps max? PCIe 1.x can push 250 MBps per lane (in each direction). A x8 connection should max out around 2,000 MBps, which sounds just about right for a dual 10 GbE card.
In the opening statements it basically boils down to file servers being the biggest bandwidth hogs, so i'd like to see a SMB and enterprise review of how exactly you could saturate these connections, comparing the 4x1gb port to your 10GB cards in real world usage. Everyone use's chariot to show theoretical numbers, but i'd like to see real world examples.
What kind of raid arrays, or SSD's and CPU's are required on both the server AND CLIENT side of these cards to really utilize that much bandwidth?
Other then a scenario such as 4 or 5 clients all writing large sequential files to a fileserver at the same time i'm having trouble seeing the need for 10Gb connection, even at that level you'd be limited by hard disk performance on a 4 or maybe even 8 disk raid array unless you're using 15k drives in raid 0.
I guess i'd like to see the other half of this "affordable 10Gb" explained for SMB and how best to use it, when it's usable, and what is required beyond the server's NIC to use it.
Continuing the above example, if the 4 or 5 clients were reading off a server instead of writting you begin to be limited by the client CPU and HD write speeds, in this scenario what upgrades are required on the client side to best make use of the 10Gb server?
The biggest benefit for 10Gb is not bandwidth, it's port consolidation, thus reducing total cost.
Then it comes down to how much IO the storage subsystem can provide. If the storage system can only provide 500MB/s, then how can 10Gb nic help?
I also don't understand why anyone wants to run a file server as a VM, and connects to NAS to store actual data. NAS is designed for it already, why add another layer.
File server access is - as far as I have seen - not that random. In our case it used to stream (OS + desktop apps) images, software installations etc.
So in most cases you have relatively few users that are downloading hundreds of MB. Why would you not consolidate that file server? It uses very little CPU power (compared to the webservers) most of the time, and it can use the power of your SAN pretty well as it sequentially access the disks. Why would you need a separate NAS?
Once your NAS is integrated in your virtualized platform, you can get the benefit of HA, live migration etc.
For most people, their storage for virtualized platform is NAS based(NFS/iSCSI). I still put iSCSI into NAS as it's an addon to NAS. Most NAS devices support multiple protocols - NFS, CIFS, ISCSI, etc.
If you don't have a proper NAS device, that's a different story, but if you do, why do you waste resources on virtual host to duplicate the features your NAS already provides?
Only thing I can think of at the moment is your SAN is overburdened and you want to move portions of it into your VM to give your SAN more resources to do other things. As mentioned, streaming system images can be put on a cheap/simple NAS or VM where you allow your SAN with all its features to do what you paid for it to do. Seams like a quick fix to free up your SAN temporally, however it is rare to see any IT shop set things up ideally. There are always various constraints.
Furthermore where do the upgrades stop? Dual NIC's are common on workstations but you can also get triple and quad built in, or add in cards. Where do you stop?
Maybe i'm looking for an answer to a question that doesn't have a clear cut answer, it's just a balancing act, and you have to balance performance with home much you have to spend.
If you upgrade the server to remove it as a bottleneck, then your clients become the bottleneck, if you team up enough client NIC's then your server become's the bottleneck again, if you upgrade the server with PCIe solid state drive like the Fusion IO and several 10Gb connections then your clients and your switch start to become the bottleneck, and on and on....
If you use "IT" "upgrades" and "end" in the same post, well... it doesn't end. It ends the day megacorporations can run off a handful of servers, which is never because the requirements keep going up. Like for example your HDD bottleneck, well then let's install a SSD array that can run tens (hundreds?) of thousands of IOPS and several Gbit/s speeds and something else becomes the bottleneck. It's been this way for decades.
You stop when you have enough performance to meet your needs. How much is that? Depends on your needs. Where's the bottleneck? A bit of investigation will identify them.
If you have a server serving a bunch of clients, and the server network performance is unacceptable, then increasing the number of 1Gbe ports on the server is likely your best choice if you have expansion capability; if not then port/slot consolidation using 10Gbe may be appropriate. However, if server performance is limited by other factors (e.g., CPU/disk), then that's where you should focus.
If you have clients hitting a server, and the client network performance is unacceptable (and the server performance is OK), then (in general) aggregating ports on the client won't get you much (if anything). In that case 10Gbe on the client may be appropriate. However, if client performance is limited by other factors (e.g., CPU/disk), then that's where you should focus.
Link aggregation works best when traffic is going to different sources/destinations, and is generally most useful on a server which is serving multiple clients (or between switches with a variety of end-point IP's).
4X 1Gbe links != 1x 4Gbe link. Link aggregation and load balancing across multiple links is typically based on source/destination IP. If they're the same, they'll follow the same link/path, and link aggregation won't buy you much because all of the packets from the same source/destination are following the same path--which means they go over the same link, which means that the speed is limited to that of the single fastest link. (Some implementations can also load balance based on source/destination port as well as IP, which may help in some situations.)
That means that no matter how many 1Gbe links you have aggregated on client or server, the end-to-end speed for any given source/destination end-to-end IP pair (and possibly port) will never exceed that of a single 1Gbe link. (While there are ways to get around that, it's generally more trouble than it's worth and can seriously hurt performance.)
I thought this is why you had to use Link Aggregation and NIC teaming in combination, giving the client and server one IP each on multiple ethernet cords, so that when a client with 2xnic is doing say, sequential transfer from a server with 2x 3x or 4x then you could get 240MB/s throughput if the storage systems can handle it on either end, but when a 2x client connects to anther 1x client then you're limited by the slower of the two connections and thus only capable of 120MB/s max, which would open the door to still have 120MB/s combing from another client at the same time.
Maybe it's all this SSD talk as of late, but i just want to see some of those options and bottlenecks tackled in real life and i just don't happen to have 5 or 10 SSD's kicking around to try it myself.
Link Aggregation (LACP) == NIC teaming (c.f., 802.3ad/802.1AX). Assigning different IP's will not get you anything unless higher layers are aware and are capable (e.g., use multipath which can improve performance, but in my experience not a lot--and it comes with overhead).
Reordering Ethernet frames or IP packets can carry a heavy penalty--more than it's worth in many (most?) cases, which is why packets sent from any endpoint pair will (sans higher-order intervention) follow the same path. Endpoint pairs are typically based on IP, although some switches also use the port numbers (i.e., path == hash of IPs in the simple case, path == hash of IPs+ports in more sophisticated case). Which is why you genrally won't see performance exceed the fastest *single physical link* between endpoints (regardless of how many links you've teamed/aggregated), and which is why a single fast link can be better than teamed/aggregated links.
E.g., team 4x 1Gbe links on both client and server. You generally won't see more than 1Gb from the client to the server for any given xfer or protocol, If you run multiple xfer's using differnt protocols (i.e., different ports) and you have a smart switch, you may see > 1Gb.
In short, if you have a client with 4x 1Gb teamed/aggregated NICs, you won't see >1Gb for any IP/port pair, and probably not for any IP pair (depending on the switch/NIC smarts and how you've done your port aggregation/teaming) on the client, switch and server. Which again is why a single faster link is generally better than teaming/aggregation.
There's a simple way for you to test it... fire up an xfer from a client with teamed NICs to a server with plenty of bandwidth. In most cases it will max out at the rate of the fastest single physical link on the client (or server). Fire up another xfer using the same protocol/port. In most cases the aggregate xfer will remain about the same (all packets are following the same path). If you see an increase, congratulations, your teaming is using both IP and port hashing to distribute traffic.
In the market for a new SAN for a server in preparation for a consolidation/virtualization/headquarters move, one of my RFP requirements was for 10gbe capabilities now. Some peers of mine have questioned this requirement stating there is enough bandwidth with etherchanneled 4gb nics and FC would be the better option if thats not enough.
Thank you for doing this write up as it confirms that my hypothesis is correct and 10gbe will/is a valid requirement for the new gear from a very forward looking view.
It would be nice to see the same kit used in combination with SANS. With the constant churn of new gear, this will be very helpful.
FC is good solution where iSCSI over Gbit is not enough but 10Gbps, along with all the teething troubles, is just not worth it.
FC is reliable, 4Gb is no overpriced and has none of the issues of iSCSI.
It just works.
Granted, for heavily loaded situations, especially on blades, 10G is the way to go.
But for many medium loads, FC is often the simpler/cheaper option.
Which issues of iSCSI exactly? And FC is still $600 per port if I am not mistaken?
Not to mention that you need to import a whole new kind of knowledge in your organisation, while iSCSI works with the UTP/Ethernet tech that every decent ITer knows.
The issues with latency, reliability, multipathing etc. etc.
Basically the strongest point of iSCSI is the low up-front price and single-infrastructure mantra.
Optimal for small or scale-out. Not so much for mid projects.
"... while iSCSI works with the UTP/Ethernet tech that every decent ITer knows ..."
Sorry to break the news, but any serious IT shop has an in-house FC-storage experience going back a decade or more.
1Gbps is really not in a competition with FC. It is an order of magnuitude below in latency and protocol overhead (read low IOps).
Whe the fun starts is 10G vs FC and serious 10G infrastructure is actually more expensive per port than FC.
When iSCSI over 10G shines is in port consolidation oportunity.
Not in bandwith, not in latency, not in price/port.
HI,
thanks for article!
Btw I am reading your site because of your virtualization articles.
I planned almost 3 years ago for IT project with only a 1/5 of complete budget for small virtualization scenario.
If you want redundancy, It can´t get much simplier than that:
- 2 ESX servers
- one SAN + one NFS/iSCSI/potentially FC storage for D2D backup
- 2 TCP switches, 2 FC switches
world moved, IT changed, EU dotation took too long to process - we finished last summer what was planned years ago...
My 2 cents from small company finishing small IT virtualization project?
FC saved my ass.
iSCSI was on my list (DELL gear), but went FC instead(HP) for lower total price (thanks crisis :-)
HP hardware looked sweet on specs sheets, and actual HW is superb, BUT.... FW sucked BIG TIME.
IT took HP half year to fix it.
HP 2910al switches do have option for up to 4 10gbit ports - that was the reason I bought them last summer.
Coupled with DA cables - very cheap solution how to get 10gbit to your small VMware cluster. (viable 100% now)
But unfortunatelly FW (that time) sucked so much, that 3 out of 4 supplied DA cables did not work at all (out of the box).
Thanks to HP - they changed our DA for 10gbit SFP+ SR optics! :-)
After installation we had several issues with "dead ESX cluster".
Not even ping worked!
FC worked flawlessly through these nightmares.
Swithces again...
Spanning tree protocol bug ate our cluster.
Now we are happy finally. Everything works as advertised.
10gbit primary links are backed up by 1gbit stand-by.
Insane backup speeds of whole VMs compared to our legacy SMB solution to nexenta storage appliance.
Thank you. Very nice suggestion especially since we already started to test this out :-). Will have to wait until April though, as we got a lot of server CPU launches this month;
Aren't the new 32nm Intel server platforms coming with standard 10gbe nics? After my SAN project, going to phase in the new 32nm cpu servers mainly for AES-NI. The 10gbe nics would be an added bonus.
Basically, it seems like using infiniband to connect each server to an infinibandswitch. And that infiniband connection is then used by a software which offers both a virtual HBA and a virtual NIC. Right? Innovative, but starting at $100k, looks expensive to me.
"Typically, we’ll probably see something like 20 to 50 VMs on such machines."
That would be a low vm per core count in my environment. I typically have 40 vms or more running on a 16 core host that is populated with 96 GB of Ram.
Agreed. With Nahalems it's about a 2 VM's per core ratio in our environment. And that's conservative. At least with vSphere and overcommit capabilities.
All depends on design and application type, we typically have 5-6 VM's on a 12 core 32GB machine and about 350 of those, running in a constant 60-70% CPU utilization range.
Sorry I'm so late to this thread, but I was curious to know what the vSwitch is doing during the benchmark? How is it configured? @emuslin notes that SR-IOV is more than just VMDq, and AFAIK the Intel 82598EB doesn't support SR-IOV so what we're seeing it the boost from NetQueue. What support for SR-IOV is there in ESX these days?
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
49 Comments
Back to Article
fredsky - Tuesday, March 9, 2010 - link
we do use 10GbE at work, and i passed a long time finding the right solutiom- CX4 is outdated, huge cable, short length power hungry
- XFP is also outdated and fiber only
- SFP + is THE thing to get. very long power, and can used with copper twinax AS WELL as fiber. you can get a 7m twinax cable for 150$.
and the BEST card available are Myricom very powerfull for a decent price.
DanLikesTech - Tuesday, March 29, 2011 - link
CX4 is old? outdated? I just connected two VM host servers using CX4 at 20Gb (40Gb aggregate bandwidth)And it cost me $150. $50 for each card and $50 for the cable.
DanLikesTech - Tuesday, March 29, 2011 - link
And not to mention the low latency of InfiniBand compared to 10GbE.http://www.clustermonkey.net/content/view/222/1/
thehevy - Tuesday, March 9, 2010 - link
Great post. Here is a link to a white paper that I wrote to provide some best practice guidance when using 10G and VMware vShpere 4.Simplify VMware vSphere* 4 Networking with Intel® Ethernet 10 Gigabit Server Adapters white paper -- http://download.intel.com/support/network/sb/10gbe...">http://download.intel.com/support/network/sb/10gbe...
More white papers and details on Intel Ethernet products can be found at www.intel.com/go/ethernet
Brian Johnson, Product Marketing Engineer, 10GbE Silicon, LAN Access Division
Intel Corporation
Linkedin: www.linkedin.com/in/thehevy
twitter: http://twitter.com/thehevy">http://twitter.com/thehevy
emusln - Tuesday, March 9, 2010 - link
Be aware that VMDq is not SR-IOV. Yes, VMDq and NetQueue are methods for splitting the data stream across different interrupts and cpus, but they still go through the hypervisor and vSwitch from the one PCI device/function. With SR-IOV, the VM is directly connected to a virtual PCI function hosted on the SR-IOV capable device. The hypervisor is needed to set up the connection, then gets out of the way. This allows the NIC device, with a little help from an iommu, to DMA directly into the VM's memory, rather than jumping through hypervisor buffers. Intel supports this in their 82599 follow-on to the 82598 that you tested.megakilo - Tuesday, March 9, 2010 - link
Johan,Regarding the 10Gb performance on native Linux, I have tested Intel 10Gb (the 82598 chipset) on RHEL 5.4 with iperf/netperf. It runs at 9.x Gb/s with a single port NIC and about 16Gb/s with a dual-port NIC. I just have a little doubt about the Ixia IxChariot benchmark since I'm not familiar about it.
-Steven
megakilo - Tuesday, March 9, 2010 - link
BTW, in order to reach 9+ Gb/s, the iperf/netperf have to run multiple threads (about 2-4 threads) and use a large TCP window size (I used 512KB).JohanAnandtech - Tuesday, March 9, 2010 - link
Thanks. Good feedback! We'll try this out ourselves.sht - Wednesday, March 10, 2010 - link
I was surprised by the poor native Linux results as well. I got > 9 Gbit/s with Broadcom NetXtreme using nuttcp as well. I don't recall whether multiple threads were required to achieve those numbers. I don't think they were, but perhaps using a newer kernel helped, the Linux networking stack has improved substantially since 2.6.18.themelon - Tuesday, March 9, 2010 - link
Did I miss where you mention this or did you completely leave it out of the article?Intel has had VMDq in Gig-E for at least 3-4 years in the 82575/82576 chips. Basically, anything using the igb driver instead of the e1000g driver.
RequiemsAllure - Tuesday, March 9, 2010 - link
So, basically what these cards are doing (figuratively speaking) they are taking in"multiplexing" 8 or 16 requests (how however many virtual queues) together into a single NIC sorting (demultiplexing) them to a respective VM the VM then takes care of the request and sends it on its way.can anyone tell me if i got this right?
has407 - Wednesday, March 10, 2010 - link
Yes, I think you've got it... that's pretty much how it works. At the risk of oversimplifying... these cards are like a multi-port switch with 10Gbe uplinks.Consider a physical analog (depending on the card, and not exact but close enough): 8/16x 1Gbe ports on the server connected to a switch with 8/16x 1Gbe ports and 1/2x 10Gbe uplinks to the backbone.
Now replace that with a card on the server and 1/2x 10Gbe backbone ports. Port/switch/cable consolidation ratios of 8:1 or 16:1 can save serious $$$ (and with better/dynamic bandwidth allocation).
The typical sticking point is that 10Gbe switches/routers are still quite expensive, and unless you've got a critical mass of 10Gbe, the infrastructure cost can be a tough hump to get over.
LuxZg - Tuesday, March 9, 2010 - link
I've got to admit that I've skimped through the article (and first page ad a half of commnts).. But it seems through your testing & numbers that you haven't used a dedicated NIC for every card in the 4x 1Gbit example (4 VMs test), otherwise you'd get lower CPU numbers simly because you skip on the load scheduling that's done on CPU.Any "VM expert" will tell you that you have 3 basic bottlenecks in any VM server:
- RAM (the more the better, mostly not a problem)
- disks (again, more is better, and absolutele minimum is at least one drive per VM)
- NICs
For NICs basic rule would be - if VM is loaded with network-heavy application, than VM should have a dedicated NIC. CPU utilization drops heavily, and NIC utilization is higher.
Having one 10Gbit NIC shared among 8 VMs which are all bottlenecked by NICs means you have your 35% CPU load. With one NIC dedicated to each VM you'd have CPU load near zero at file-copy loads (NIC has hardware scheduler, disc controller has the same for HDDs).
Like I've said, maybe I've overlooked something in article, but it seems to me your test are based on wrong assumptions. Besides, if you've got 8 file servers as VM, you've got an unnecessary overhead as well, it's one application (file serving) so no need to virtualize to 8 VMs on same hardware.
As a conclusion, VMs are all about planning, so I believe your test had a wrong approach.
JohanAnandtech - Tuesday, March 9, 2010 - link
"a dedicated NIC for every VM"That might be the right approach when you have a few VMs on the server, but it does not seem to be reasonable when you have tens of VMs running. What do you mean by dedicating? pass-through? port grouping? Only Pass-through has near zero CPU load AFAIK, and I don't see many scenarios where pass-through is handy.
Also, if you use dedicated NICs for network intensive apps, that means that you can not use that bandwidth for the occasional spike in another "non NIC priviledged" VM.
It might not be feasible at all if you use DRS or Live migration.
The whole point of VMDQ is to offer the bandwidth necessary to the VM that needs it (for example give one VM 5 GBit/s, One VM 1 gbit/s and the others only 1 Mbit/s) and that the layer 2 routing overhead is mostly on the NIC. It seems to me that the planning you promote is very inflexible and I can see several scenario's where dedicated NICs will perform worse than one big pipe which can be load balanced accross the different VMs.
LuxZg - Wednesday, March 10, 2010 - link
Yes, I meant "dedicated" as "pass-through".Yes, there are several scenarios where "one big" is better than several small ones, but think if 35% CPU load (and that's 35% of a very-expensive-CPU) is worth as sacrifice to have a reserve for few occasional spikes.
I do agree that putting several VMs on one NIC is ok, but that's for applications that aren't loaded with heavy network transfers. VM load balancing should be done for example like this (just a stupid example, don't hold onto it too hard):
- you have file server as one VM
- you have mail server on second VM
- you have some CPU-heavy app on separate VM
File server is heavy on networking and disc subsystem, but almost none on RAM/CPU. Mail server is dependant on several variables (antiSPAM, antivirus, amount of mailboxes & incoming mail, etc), so it can be light-to-heavy load for all subsystems. For this example let's say it's a lighter kind of load. Let's say this hardware machine has 2 NICs. You've got few CPUs with multiple cores, and plenty of disc/RAM. So what's right to do? Adding a CPU intensive VM, so that CPU isn't idle too much. You dedicate one NIC to file server, and you let mail server share NIC with CPU-intensive VM. That way file server has enough bandwidth that isn't taxing CPU to 35% cos of stupid virtual routing of great amounts of network packets, CPU is left mostly free for the CPU-intensive VM, and mail server happily lives in between the two, as it will be satisfied with leftover CPU and networking..
Now scale that to 20-30 VMs, and all you need is 10 NICs. For VMs that aren't network dependant you put them on "shared NICs", and for network-intensive apps you give those VMs dedicated NIC.
Just remember - 35% of a multi-socket & multi-core server is a huge expense, when you can do it on a dedicated NIC. NIC is, was, and will be much more cost effective for doing network packet scheduling than CPU.. Why pay several thousand $$$ for CPU if all you need is another NIC.
LuxZg - Tuesday, March 9, 2010 - link
I hate my own typos.. 2nd sentence.. "dedicated NIC for every VM" .. not "for every card".. probably there are more nonsense.. I'm in a hurry, sorry ppl!anakha32 - Tuesday, March 9, 2010 - link
All the new 10G kit appears to be coming with SFP+ connectors. They can be used either with a transceiver for optical, or a pre-terminated copper cable (known as 'SFP+ Direct Attach').CX4 seems to be deprecated as the cables are quite big and cumbersome.
zypresse - Tuesday, March 9, 2010 - link
I've seen some mini-Clusters (3-10 machines) lately with ethernet interconnects. Although I doubt that this is best solution, it would be nice to know how 10G ethernet actually performs in that area.Calin - Tuesday, March 9, 2010 - link
I don't find a power use of <10W for a 10Gb link such a bad compromise over 0.5W per 1Gb Ethernet link (assuming that you can use that 10Gb link at close to maximum capacity). If nothing else, you're trading two 4-port 1Gb network cards for one 10Gb card.MGSsancho - Tuesday, March 9, 2010 - link
Suns 40BGs adapters are not terribly expensive (start at $1500.) apparently they support 8 virtual lanes? So Mellanox provides Sun their silicon. went to their site and they do have other silicon/cards that explicitly state they support Virtual Protocol Interconnect. I'm curious if this is the same thing. I know you stated that the need really isn't there but would be interesting to see if you can ask for testing samples or look into the viability of Infiniband. Looking at their partners page they provide the silicon for xsigo as a previous poster stated. Again would be nice to see if 40Gb Infiniband with and without VPI technologies is superior to 10Gb Ethernet with acceleration as you provided with us today. For SANs, anything to lower latency for iscsi is desired. Perhaps spending a little for reduced latency on the network layer makes it worth the extra price for faster transactions? So many possibilities! Thank you for all the insightful research you have provided us!Parak - Tuesday, March 9, 2010 - link
The per-port prices of 10Gbe are still $ludicrous; you're not going to be able to connect an entire vmware farm plus storage at a "reasonable" price. I'd suggest looking at infiniband:Pros:
40Gb/s theoretical - about 25Gb/s maximum out of single stream ip traffic, or 2.5x faster than 10Gbe.
Per switch port costs of about 3x-4x times less that of 10Gbe, and comparable per adapter port costs.
Latency even lower than 10Gbe.
Able to do remote direct memory access for specialized protocols (google helps here).
Fully supported under your major operating systems, including ESX4.
Cons:
Hefty learning curve. Expect to delve into mailing lists and obscure documentations, although just the "basic" ip functionality is easy enough to get started with.
10Gbe has the familiarity concept going for it, but it is just not cost effective enough yet, where as infiniband just seems to get cheaper, faster, and lately, a lot more user friendly. Just something to consider next time :D
has407 - Monday, March 8, 2010 - link
Thanks. Good first-order test and summary. A few more details and tests would be great, and I look forward to more on this subject...1. It would be interesting to see what happens when the number of VM's exceeds the number of VMDQ's provided by the interface. E.g., 20-30 VM's with 16 VMDQ's... does it fall on its face? If yes, that has significant implications for hardware selection and VM/hardware placement.
2. Would be interesting to see if the Supermicro/Intel NIC can actually drive both ports at close to an aggregate 20Gbs.
3. What were the specific test parameters used (MTU, readers/writers, etc)? I ask because those throughput numbers seem a bit low for the non-virtual test (wouldn't have been surprised 2-3 years ago) and very small changes can have very large effects with 10Gbe.
4. I assume most of the tests were primarily unidirectional? Would be interesting to see performance under full-duplex load.
> "In general, we would advise going with link aggregation of quad-port gigabit Ethernet ports in native mode (Linux, Windows) for non-virtualized servers."
10x 1Gbe links != 1x 10Gbe link. Before making such decisions, people need to understand how link aggregation works and its limitations.
> "10Gbit is no longer limited to the happy few but is a viable backbone technology."
I'd say it has been for some time, as vendors who staked their lives on FC or Infiniband have discovered over the last couple years much to their chagrin (at least outside of niche markets). Consolidation using 10Gbe has been happening for a while.
tokath - Tuesday, March 9, 2010 - link
"2. Would be interesting to see if the Supermicro/Intel NIC can actually drive both ports at close to an aggregate 20Gbs. "At best since it's a PCIe 1.1 x8 would be about 12Gbps per direction for a total aggregate throughput of about 24Gbps bi-directional traffic.
The PCIe 2.0 x8 dual port 10Gb NICs can push line rate on both ports.
somedude1234 - Wednesday, March 10, 2010 - link
"At best since it's a PCIe 1.1 x8 would be about 12Gbps per direction for a total aggregate throughput of about 24Gbps bi-directional traffic."How are you figuring 12 Gbps max? PCIe 1.x can push 250 MBps per lane (in each direction). A x8 connection should max out around 2,000 MBps, which sounds just about right for a dual 10 GbE card.
mlambert - Monday, March 8, 2010 - link
This is a great article and I hope to see more like it.krazyderek - Monday, March 8, 2010 - link
In the opening statements it basically boils down to file servers being the biggest bandwidth hogs, so i'd like to see a SMB and enterprise review of how exactly you could saturate these connections, comparing the 4x1gb port to your 10GB cards in real world usage. Everyone use's chariot to show theoretical numbers, but i'd like to see real world examples.What kind of raid arrays, or SSD's and CPU's are required on both the server AND CLIENT side of these cards to really utilize that much bandwidth?
Other then a scenario such as 4 or 5 clients all writing large sequential files to a fileserver at the same time i'm having trouble seeing the need for 10Gb connection, even at that level you'd be limited by hard disk performance on a 4 or maybe even 8 disk raid array unless you're using 15k drives in raid 0.
I guess i'd like to see the other half of this "affordable 10Gb" explained for SMB and how best to use it, when it's usable, and what is required beyond the server's NIC to use it.
Continuing the above example, if the 4 or 5 clients were reading off a server instead of writting you begin to be limited by the client CPU and HD write speeds, in this scenario what upgrades are required on the client side to best make use of the 10Gb server?
Hope this doesn't sound to newb.
dilidolo - Monday, March 8, 2010 - link
I agree with you.The biggest benefit for 10Gb is not bandwidth, it's port consolidation, thus reducing total cost.
Then it comes down to how much IO the storage subsystem can provide. If the storage system can only provide 500MB/s, then how can 10Gb nic help?
I also don't understand why anyone wants to run a file server as a VM, and connects to NAS to store actual data. NAS is designed for it already, why add another layer.
JohanAnandtech - Monday, March 8, 2010 - link
File server access is - as far as I have seen - not that random. In our case it used to stream (OS + desktop apps) images, software installations etc.So in most cases you have relatively few users that are downloading hundreds of MB. Why would you not consolidate that file server? It uses very little CPU power (compared to the webservers) most of the time, and it can use the power of your SAN pretty well as it sequentially access the disks. Why would you need a separate NAS?
Once your NAS is integrated in your virtualized platform, you can get the benefit of HA, live migration etc.
dilidolo - Monday, March 8, 2010 - link
For most people, their storage for virtualized platform is NAS based(NFS/iSCSI). I still put iSCSI into NAS as it's an addon to NAS. Most NAS devices support multiple protocols - NFS, CIFS, ISCSI, etc.If you don't have a proper NAS device, that's a different story, but if you do, why do you waste resources on virtual host to duplicate the features your NAS already provides?
MGSsancho - Tuesday, March 9, 2010 - link
Only thing I can think of at the moment is your SAN is overburdened and you want to move portions of it into your VM to give your SAN more resources to do other things. As mentioned, streaming system images can be put on a cheap/simple NAS or VM where you allow your SAN with all its features to do what you paid for it to do. Seams like a quick fix to free up your SAN temporally, however it is rare to see any IT shop set things up ideally. There are always various constraints.krazyderek - Monday, March 8, 2010 - link
Furthermore where do the upgrades stop? Dual NIC's are common on workstations but you can also get triple and quad built in, or add in cards. Where do you stop?Maybe i'm looking for an answer to a question that doesn't have a clear cut answer, it's just a balancing act, and you have to balance performance with home much you have to spend.
If you upgrade the server to remove it as a bottleneck, then your clients become the bottleneck, if you team up enough client NIC's then your server become's the bottleneck again, if you upgrade the server with PCIe solid state drive like the Fusion IO and several 10Gb connections then your clients and your switch start to become the bottleneck, and on and on....
Kjella - Tuesday, March 9, 2010 - link
If you use "IT" "upgrades" and "end" in the same post, well... it doesn't end. It ends the day megacorporations can run off a handful of servers, which is never because the requirements keep going up. Like for example your HDD bottleneck, well then let's install a SSD array that can run tens (hundreds?) of thousands of IOPS and several Gbit/s speeds and something else becomes the bottleneck. It's been this way for decades.has407 - Tuesday, March 9, 2010 - link
You stop when you have enough performance to meet your needs. How much is that? Depends on your needs. Where's the bottleneck? A bit of investigation will identify them.If you have a server serving a bunch of clients, and the server network performance is unacceptable, then increasing the number of 1Gbe ports on the server is likely your best choice if you have expansion capability; if not then port/slot consolidation using 10Gbe may be appropriate. However, if server performance is limited by other factors (e.g., CPU/disk), then that's where you should focus.
If you have clients hitting a server, and the client network performance is unacceptable (and the server performance is OK), then (in general) aggregating ports on the client won't get you much (if anything). In that case 10Gbe on the client may be appropriate. However, if client performance is limited by other factors (e.g., CPU/disk), then that's where you should focus.
Link aggregation works best when traffic is going to different sources/destinations, and is generally most useful on a server which is serving multiple clients (or between switches with a variety of end-point IP's).
4X 1Gbe links != 1x 4Gbe link. Link aggregation and load balancing across multiple links is typically based on source/destination IP. If they're the same, they'll follow the same link/path, and link aggregation won't buy you much because all of the packets from the same source/destination are following the same path--which means they go over the same link, which means that the speed is limited to that of the single fastest link. (Some implementations can also load balance based on source/destination port as well as IP, which may help in some situations.)
That means that no matter how many 1Gbe links you have aggregated on client or server, the end-to-end speed for any given source/destination end-to-end IP pair (and possibly port) will never exceed that of a single 1Gbe link. (While there are ways to get around that, it's generally more trouble than it's worth and can seriously hurt performance.)
krazyderek - Tuesday, March 9, 2010 - link
I thought this is why you had to use Link Aggregation and NIC teaming in combination, giving the client and server one IP each on multiple ethernet cords, so that when a client with 2xnic is doing say, sequential transfer from a server with 2x 3x or 4x then you could get 240MB/s throughput if the storage systems can handle it on either end, but when a 2x client connects to anther 1x client then you're limited by the slower of the two connections and thus only capable of 120MB/s max, which would open the door to still have 120MB/s combing from another client at the same time.Maybe it's all this SSD talk as of late, but i just want to see some of those options and bottlenecks tackled in real life and i just don't happen to have 5 or 10 SSD's kicking around to try it myself.
has407 - Wednesday, March 10, 2010 - link
Link Aggregation (LACP) == NIC teaming (c.f., 802.3ad/802.1AX). Assigning different IP's will not get you anything unless higher layers are aware and are capable (e.g., use multipath which can improve performance, but in my experience not a lot--and it comes with overhead).Reordering Ethernet frames or IP packets can carry a heavy penalty--more than it's worth in many (most?) cases, which is why packets sent from any endpoint pair will (sans higher-order intervention) follow the same path. Endpoint pairs are typically based on IP, although some switches also use the port numbers (i.e., path == hash of IPs in the simple case, path == hash of IPs+ports in more sophisticated case). Which is why you genrally won't see performance exceed the fastest *single physical link* between endpoints (regardless of how many links you've teamed/aggregated), and which is why a single fast link can be better than teamed/aggregated links.
E.g., team 4x 1Gbe links on both client and server. You generally won't see more than 1Gb from the client to the server for any given xfer or protocol, If you run multiple xfer's using differnt protocols (i.e., different ports) and you have a smart switch, you may see > 1Gb.
In short, if you have a client with 4x 1Gb teamed/aggregated NICs, you won't see >1Gb for any IP/port pair, and probably not for any IP pair (depending on the switch/NIC smarts and how you've done your port aggregation/teaming) on the client, switch and server. Which again is why a single faster link is generally better than teaming/aggregation.
There's a simple way for you to test it... fire up an xfer from a client with teamed NICs to a server with plenty of bandwidth. In most cases it will max out at the rate of the fastest single physical link on the client (or server). Fire up another xfer using the same protocol/port. In most cases the aggregate xfer will remain about the same (all packets are following the same path). If you see an increase, congratulations, your teaming is using both IP and port hashing to distribute traffic.
Lord 666 - Monday, March 8, 2010 - link
In the market for a new SAN for a server in preparation for a consolidation/virtualization/headquarters move, one of my RFP requirements was for 10gbe capabilities now. Some peers of mine have questioned this requirement stating there is enough bandwidth with etherchanneled 4gb nics and FC would be the better option if thats not enough.Thank you for doing this write up as it confirms that my hypothesis is correct and 10gbe will/is a valid requirement for the new gear from a very forward looking view.
It would be nice to see the same kit used in combination with SANS. With the constant churn of new gear, this will be very helpful.
has407 - Monday, March 8, 2010 - link
Agree. Anyone who suggests FC is the answer today is either running on inertia or trying to justify a legacy FC investment/infrastructure.Build on 10Gbe if at all possible; if you need FC in places, look at FCoE.
mino - Tuesday, March 9, 2010 - link
FC is good solution where iSCSI over Gbit is not enough but 10Gbps, along with all the teething troubles, is just not worth it.FC is reliable, 4Gb is no overpriced and has none of the issues of iSCSI.
It just works.
Granted, for heavily loaded situations, especially on blades, 10G is the way to go.
But for many medium loads, FC is often the simpler/cheaper option.
JohanAnandtech - Tuesday, March 9, 2010 - link
Which issues of iSCSI exactly? And FC is still $600 per port if I am not mistaken?Not to mention that you need to import a whole new kind of knowledge in your organisation, while iSCSI works with the UTP/Ethernet tech that every decent ITer knows.
mino - Tuesday, March 16, 2010 - link
The issues with latency, reliability, multipathing etc. etc.Basically the strongest point of iSCSI is the low up-front price and single-infrastructure mantra.
Optimal for small or scale-out. Not so much for mid projects.
"... while iSCSI works with the UTP/Ethernet tech that every decent ITer knows ..."
Sorry to break the news, but any serious IT shop has an in-house FC-storage experience going back a decade or more.
1Gbps is really not in a competition with FC. It is an order of magnuitude below in latency and protocol overhead (read low IOps).
Whe the fun starts is 10G vs FC and serious 10G infrastructure is actually more expensive per port than FC.
When iSCSI over 10G shines is in port consolidation oportunity.
Not in bandwith, not in latency, not in price/port.
radimf - Wednesday, March 10, 2010 - link
HI,thanks for article!
Btw I am reading your site because of your virtualization articles.
I planned almost 3 years ago for IT project with only a 1/5 of complete budget for small virtualization scenario.
If you want redundancy, It can´t get much simplier than that:
- 2 ESX servers
- one SAN + one NFS/iSCSI/potentially FC storage for D2D backup
- 2 TCP switches, 2 FC switches
world moved, IT changed, EU dotation took too long to process - we finished last summer what was planned years ago...
My 2 cents from small company finishing small IT virtualization project?
FC saved my ass.
iSCSI was on my list (DELL gear), but went FC instead(HP) for lower total price (thanks crisis :-)
HP hardware looked sweet on specs sheets, and actual HW is superb, BUT.... FW sucked BIG TIME.
IT took HP half year to fix it.
HP 2910al switches do have option for up to 4 10gbit ports - that was the reason I bought them last summer.
Coupled with DA cables - very cheap solution how to get 10gbit to your small VMware cluster. (viable 100% now)
But unfortunatelly FW (that time) sucked so much, that 3 out of 4 supplied DA cables did not work at all (out of the box).
Thanks to HP - they changed our DA for 10gbit SFP+ SR optics! :-)
After installation we had several issues with "dead ESX cluster".
Not even ping worked!
FC worked flawlessly through these nightmares.
Swithces again...
Spanning tree protocol bug ate our cluster.
Now we are happy finally. Everything works as advertised.
10gbit primary links are backed up by 1gbit stand-by.
Insane backup speeds of whole VMs compared to our legacy SMB solution to nexenta storage appliance.
JohanAnandtech - Monday, March 8, 2010 - link
Thank you. Very nice suggestion especially since we already started to test this out :-). Will have to wait until April though, as we got a lot of server CPU launches this month;Lord 666 - Monday, March 8, 2010 - link
Aren't the new 32nm Intel server platforms coming with standard 10gbe nics? After my SAN project, going to phase in the new 32nm cpu servers mainly for AES-NI. The 10gbe nics would be an added bonus.hescominsoon - Monday, March 8, 2010 - link
It's called xsigo(pronounced zee-go) and solves the i/o issue you are tying to solve here for vm i/o bandwidth.JohanAnandtech - Monday, March 8, 2010 - link
Basically, it seems like using infiniband to connect each server to an infinibandswitch. And that infiniband connection is then used by a software which offers both a virtual HBA and a virtual NIC. Right? Innovative, but starting at $100k, looks expensive to me.vmdude - Monday, March 8, 2010 - link
"Typically, we’ll probably see something like 20 to 50 VMs on such machines."That would be a low vm per core count in my environment. I typically have 40 vms or more running on a 16 core host that is populated with 96 GB of Ram.
ktwebb - Sunday, March 21, 2010 - link
Agreed. With Nahalems it's about a 2 VM's per core ratio in our environment. And that's conservative. At least with vSphere and overcommit capabilities.duploxxx - Monday, March 8, 2010 - link
All depends on design and application type, we typically have 5-6 VM's on a 12 core 32GB machine and about 350 of those, running in a constant 60-70% CPU utilization range.switcher - Thursday, July 29, 2010 - link
Great article and comments.Sorry I'm so late to this thread, but I was curious to know what the vSwitch is doing during the benchmark? How is it configured? @emuslin notes that SR-IOV is more than just VMDq, and AFAIK the Intel 82598EB doesn't support SR-IOV so what we're seeing it the boost from NetQueue. What support for SR-IOV is there in ESX these days?
I'd be nice to see SR-IOV data too.