Original Link: https://www.anandtech.com/show/4958/facebooks-open-compute-server-tested
Facebook's "Open Compute" Server tested
by Johan De Gelas on November 3, 2011 12:00 AM ESTFacebook Technology Overview
Facebook had 22 Million active users in the middle of 2007; fast forward to 2011 and the site now has 800 Million active users, with 400 million of them logging in every day. Facebook has grown exponentially, to say the least! To cope with this kind of exceptional growth and at the same time offer a reliable and cost effective service requires out of the box thinking. Typical high-end, brute force, ultra redundant software and hardware platforms (for example Oracle RAC databases running on top of a few IBM Power 795 systems) won’t do as they're too complicated, power hungry, and most importantly far too expensive for such extreme scaling.
Facebook first focused on thoroughly optimizing their software architecture, which we will cover briefly. The next step was the engineers at Facebook deciding to build their own servers to minimize the power and cost of their server infrastructure. Facebook Engineering then open sourced these designs to the community; you can download the specifications and mechanical CAD designs at the Open Compute site.
The Facebook Open Compute server design is ambitious: “The result is a data center full of vanity free servers which is 38% more efficient and 24% less expensive to build and run than other state-of-the-art data centers.” Even better is that Facebook Engineering sent two of these Open Compute servers to our lab for testing, allowing us to see how these servers compare to other solutions in the market.
As a competing solution we have an HP DL380 G7 in the lab. Recall from our last server clash that the HP DL380 G7 was one of the most power efficient servers of 2010. Is a server "targeted at the cloud" and designed by Facebook engineering able to beat one of the best and most popular general purpose servers? That is the question we'll answer in this article.
Cloud = x86 and open source
From a high-level perspective, the basic architecture of Facebook is not that different from other high performance web services.
However, Facebook is the poster child of the new generation of Cloud applications. It's hugely popular and very interactive, and as such it requires much more scalability and availability than your average website that mostly serves up information.
The "Cloud Application" generation did not turn to the classic high-end redundant platforms with heavy Relational Database Management Systems. A combination of x86 scale-out clusters, open source websoftware, and "no SQL" is the foundation that Facebook, Twitter, Google and others build upon.
However, facebook has improved several pieces of the Open Source software puzzle to make them more suited for extreme scalability. Facebook chose PHP as its presentation layer as it is simple to learn, write, and read. However, PHP is very CPU and memory intensive.
According to Facebook’s own numbers, PHP is about 39 times slower than C++ code. Thus it was clear that Facebook had to solve this problem first. The traditional approach is to rewrite the most performance critical parts in C++ as PHP Extensions, but Facebook tried a different solution: the engineers developed HipHop, a source code transformer. Hiphop transforms the PHP source code into faster C++ code and compiles it with g++.
The next piece in the Facebook puzzle is Memcached. Memcached is an in-RAM object caching system with some very cool features. Memcached is a distributed caching system, which means a memcached cache can span many servers. The "cache" is thus in fact a collection of smaller caches. It basically recuperates unused RAM that your operating system would probably waste on less efficient file system caching. These “cache nodes” do not sync or broadcast and as a result the memory cache is very scalable.
Facebook quickly became the world's largest user of memcached and improved memcached vastly. They ported it to 64-bit, lowered TCP memory usage, distributed network processing over multiple cores (instead of one), and so on. Facebook mostly uses memcached to alleviate database load.
The Facebook Server
In the basement of the Palo Alto, California headquarters, three Facebook engineers built Facebook's custom-designed servers, power supplies, server racks, and battery backup systems. The Facebook server had to be much cheaper than the average server, as well as more power efficient.
The first change they made was the chassis height, going for a 1.5U high design as a compromise between density and making the server easier to cool. 1.5U allows them to use taller heatsinks, larger (60mm) lower-RPM fans than the screaming 40mm energy hoggers used in a 1U chassis. The result is that the fans consume only 2% to 4% of the total power, which is pretty amazing as we have seen 1U fans that can consume up to one third of the total system power. It seems that air-cooling in the Open Compute 1.5U server is as efficient as the best 3U servers.
At the same time, Facebook Engineering kept the chassis very simple, without any plastic. It makes the airflow through the server smoother and reduces weight. The bottom plate of one server serves as the top plate for the server beneath it.
Facebook has designed an AMD and an Intel motherboard, both manufactured by Quanta. Much attention was paid to the efficiency of the voltage regulators (94% efficiency). The other trick was again to remove anything that was not absolutely necessary. These motherboards have no BMC, very few USB (2) and NIC ports (2), one expansion slot, and are headless (no videochip).
The only thing that an administrator can do remotely is "reboot over LAN". The idea is that if that does not help, the problem is in 99% of cases severe enough that you have to send an administrator to the server anyway.
The AMD servers are mostly used as Memcached servers, as the four channels of AMD Magny-cours Opterons 6100 are capable of using 12 DIMMs per CPU, or 24 DIMMs in total. That works out to 384GB of caching memory.
In contrast the Facebook Open Compute Xeon servers only have six DIMM slots as they are used for processing intensive tasks such as the PHP "assembling" data servers.
Dual PSU
The power supply has two input connectors: one for the 277V AC input and another that accepts 48V DC. The PSU can operate on 48V for about 10 minutes before getting too hot and shutting down, so the power supply is not built to run on 48V DC all the time. The idea is that 48V DC circuits replace a traditional UPS system; after a few minutes the generators should be online and the power supply should be back on the 277V AC input.
The power supply is extremely efficient: up to 94.5%.
Using 277V compared to 208V allowed Facebook to save about 3-4% of energy use, a result of lower power losses in the transmission lines.
Power Supply Efficiency Visualized
I graduated as an electromechanical engineer, but 17 years of IT jobs and research have helped me forget a lot about electricity and electronics. However, I have the advantage of running the Sizing Servers Lab (at the university college of West-flanders, Howest) and thus the privilege of working with some very talented people. Tijl Deneut told me he would be able to visualize the efficiency of the power supply. So with the advanced Racktivity PDU, he managed to produce a time graphic that shows how close the current sine wave remains to to the voltage sine wave. If the two are perfectly in phase, the power quality or power factor is 100%.
In your own home, this power factor is less important. However, large installations such as a data centers have to pay extra for bad power factors as a low power factor causes the electrical system to draw more current for the same amount of work being done, and more current results in higher heat losses.
Data centers have large power factor correctors, electronic systems with large capacitors that improve the PF but also consume energy. A bad PF can increase the Power Usage Effectiveness (PUE) of the data center, and this PUE has become an extremely important "benchmark" for data centers. The less these systems have to work the better, so the PF of a server PSU should be as close to 1 as possible.
We started by measuring while the server is close to idle, which is a pretty bad scenario for the PF. First let's look at the sine waves of the HP DL380 G7:
That's not bad at all, but next let's look at the sine waves of the AC that enters the Open Compute server
The current sine wave is not only closer to the voltage sine wave, it is also much closer to the ideal form of an AC sine wave, which makes energy delivery more efficient. It is one of the first indications that the Facebook engineers did their homework very well.
Benchmark Configuration
HP Proliant DL380 G7
CPU | Two Intel Xeon X5650 at 2.66 GHz |
RAM | 6 x 4GB Kingston DDR3-1333 FB372D3D4P13C9ED1 |
Motherboard | HP proprietary |
Chipset | Intel 5520 |
BIOS version | P67 |
PSU | 2 x HP PS-2461-1C-LF 460W HE |
We have three servers to test. The first is our own standard off-the-shelf server, and HP DL380G7. This server is the natural challenger for the Facebook design, as it is one of the most popular and efficient general purpose servers.
As this server is targeted at a very broad public, it cannot be as lean and mean as the Open Compute servers.
Facebook's Open Compute Xeon version
CPU | Two Intel Xeon X5650 at 2.66 GHz |
RAM | 6 x 4GB Kingston DDR3-1333 FB372D3D4P13C9ED1 |
Motherboard | Quanta Xeon Opencompute 1.0 |
Chipset | Intel 5500 Rev 22 |
BIOS version | F02_3A16 |
PSU | Power-One SPAFCBK-01G 450W |
The Open Compute Xeon server is configured as close to our HP DL380 G7 as possible.
Facebook's Open Compute AMD version
CPU | Two AMD Opteron Magny-Cour 6128 HE at 2.0 GHz |
RAM | 6 x 4GB Kingston DDR3-1333 FB372D3D4P13C9ED1 |
Motherboard | Quanta AMD Open Compute 1.0 |
Chipset | |
BIOS version | F01_3A07 |
PSU | Power-One SPAFCBK-01G 450W |
The benchmark numbers of the AMD Open Compute server are only included for your information. There is no direct comparison possible with the other two systems. The AMD system is better equipped than the Intel, as it has more DIMM slots and uses HE CPUs.
Common Storage system
Each server has an adaptec 5085 PCIe 8x (driver aacraid v1.1-5.1[2459] b 469512) connecting to six Cheetah 300GB 15000 RPM SAS disks in a Promise JBOD J300s.
Software configuration
VMware ESXi 5.0.0 (b 469512 - VMkernel SMP build-348481 Jan-12-2011 x86_64). All vmdks use thick provisioning, independent, and persistent. Power policy is Balanced Power.
Other notes
Both servers were fed by a standard European 230V (16 Amps max.) powerline. The room temperature was monitored and kept at 23°C.
Introducing Our Open Virtualization Benchmark
vApus Mark II has been our own virtualization benchmark suite that tests how well servers cope with virtualizing "heavy duty applications". We explained the benchmark methodology here. The beauty of vApus Mark II is that:
- We test with real-world applications used in enterprises all over the world
- We can measure response times
- It can scale from 8 to 80 thread servers
- It is lightweight on the client side: one humble client is enough to bring the most massive server to its knees. For a virtualizated server or cluster, you only need a few clients.
There is one big disadvantage, however: the OLAP and web applications are the intellectual property of several software vendors, so we can't let third parties verify our tests. To deal with this, the Sizing Servers Lab developed a new benchmark, called vApus For Open Source workloads, in short vApus FOS.
vApus FOS uses a similar methodology as vApus Mark II with "tiles". The exact software configuration may still change a bit as we tested with the 0.9 version. One vApus FOS 0.9 tile uses four different VMs, consisting of:
- A PhpBB (Apache2, MySQL) website with one virtual CPU and 1GB RAM. The website uses about 8GB of disk space. We simulate up to 50 concurrent uses with press keys every 0.6 to 2.4 s.
- The same VM but with two vCPUs.
- An OLAP MySQL database that is used by an online webshop. The VM gets two vCPUs and 1GB RAM. The database is about 1GB, with up to 500 connections active.
- Last but not least: the Zimbra VM. VMware's open source groupware offering is by far the most I/O intensive VM. This VM gets two vCPUs and 2GB RAM, with up to 100 concurrent users active.
All VMs are based on minimal CentOS 5.6 with VMware Tools installed. vApus FOS can also be run on different hypervisors: we already tried using KVM, but encountered a lot of KVM specific problems.
vApus FOS results
In the first test, we use the vApus FOS test that pushes the servers to 90 - 100% CPU load. This performance test is not really important as these kind of server / application combinations are not supposed to run at these high CPU loads. Also remember that this is not an AMD versus Intel graph. The AMD based Open Compute server is used as a memcached server that typically hogs RAM space but does not stress the CPU; the Intel Open Compute Server is built to be a CPU intensive web application server. Thus, you should not compare them directly.
The real comparison is between the HP DL-380G7 and the Facebook Open Compute Xeon Server, which both use the same platform: the same CPUs, the same amount of RAM, and so on. The big question we want to answer is whether Facebook's server that is built specifically for low power use and cloud applications can offer a better performance/watt ratio than one of the best and most popular "general purpose" servers.
When requiring the highest performance levels, the HP DL380 G7 is about 11% faster than the Open Compute alternative. We suspect that the Open Compute server is configured to prefer certain lower power, lower performance ACPI settings. However, as this server is not meant to be an HPC server, this matters little. A web server or even virtualized server should not be run at 95-100% CPU load anyway. Let us take a look at the corresponding power consumption.
To deliver 11% higher performance, the HP server has to consume about 22% more power. The Open Compute servers deliver a higher performance/watt even at high performance levels. The advantage is small, but again these servers are not meant to operate at 95+ % CPU load.
We also checked the power consumption at idle.
The results are amazing: the Open Compute servers need only 74% of the power of the HP, saving a solid 42W when running ide. Also remember that the HP DL380 is already one of the best servers on the market from power consumption point of view.
Let us see what happens if we go for a real-world scenario.
Measuring Real-world Power Consumption, Part 1
vApus FOS EWL
The Equal Workload (EWL) test is very similar to our previous vApus Mark II "Real-world Power" test. To create a real-world “equal workload” scenario, we throttle the number of users in each VM to a point where you typically get somewhere between 20% and 80% CPU load on a modern dual CPU server. This is the CPU load of vApus FOS:
Note that we do not measure performance in the "start up" phase and at the end of test.
Compare this with vApus FOS EWL:
In this case we measure until the very end. The amount of work to be done is always equal, and the faster the system, the sooner it can idle. The time of the test is always the same, and all tested systems will spend some time in idle. The faster the system, the faster the workload will be done and the more time will be spent at idle. For this test we do not measure power but energy (power x time) consumed.
The measured performance cannot be compared as in "system x is z% faster than system y", but it does give you an idea of how well the server handles the load and how quickly it will save energy by entering a low power state.
The Xeons are all in the same ballpark. The AMD system with its slower CPUs needs more time to deal with this workload. One interesting thing to note is that Hyper-Threading does not boost throughput. That is not very surprising, considering that the total CPU load is between 20 and 80%. What about response time?
Note that we do not simply take a geomean of the response times. All response times are compared to the reference values. Those percentages (Response time/reference Response time) are then geometrically averaged.
The reference values are measured on the HP DL380 G7 running a native CentOS 5.6 client. We run four tiles of seven vCPUs on top of each server. So the value 117 means that the VMs are on average 17% slower than on the native machine. The 17% higher response times are a result of the fact that when a VM demands two virtual Xeon CPUs, the hypervisor cannot always oblige. It has 24 logical CPUs available, and 28 (7 vCPUs x 4 tiles) are requested. In contrast, the software running on the native machine gets two real cores.
Back to our results. The response time of the AMD based server tells us that even under medium load, a faster CPU can help to reduce the response time, which is the most important performance parameter anyway. However, Hyper-Threading does not help under these circumstances.
Also note that the Open Compute server handles this kind of load slightly better than the HP. So while the Open Compute servers offer a slightly lower top performance, they are at their best in the most realistic benchmark scenarios: between 20% and 80% CPU load. Of course, performance per watt remains the most important metric:
When the CPU load is between 20 and 80%, which is realistic, the HP uses 15% more power. We can reduce the energy consumed by another 10% if we disable Hyper-Threading, which as noted does not improve performance in this scenario anyway.
Measuring Real-World Power Consumption, Part 2
For our second real-world power test, we turn to our "proprietary" virtualization benchmark vApus Mark II EWL. We'll see if this VMware + Windows 2008 combination produces similar results to vApus FOS EWL. You can find out more details about vApus Mark II EWL here. This workload uses several IIS web sites and MS SQL Server 2008 server. First we'll check performance.
Interestingly, Hyper-Threading does make a difference here: we get about 10% higher performance. When we zoom in on our results, we see that especially the MS SQL VMs perform better with Hyper-Threading: we see an 18% performance boost. The Open Compute server performs again slighly better than the HP.
The HP server needs 12% more power to deliver the same performance. The Open Compute server once again delivers superior performance/watt.
Conclusion
The HP DL380 G7 continues to earn our respect as a very effcient server. It is also a much easier server to handle, thanks to its integrated graphics chip and remote management (BMC). Still, it is clear that these features are not that important for web applications that have to scale out over a large number of servers.
Each rack at Facebook contains 30 Open Compute servers
The Facebook Open Compute servers have made quite an impression on us. Remember, this is Facebook's first attempt to build a cloud server! This server uses very little power when running at low load (see our idle numbers) and offers slightly better performance while consuming less energy than one of the best general purpose servers on the market. The power supply power factor is also top notch, resulting in even more savings (e.g. power factoring correction) in the data center.
While it's possible to look at the Open Compute servers as a "Cloud only" solution, we imagine anyone with quite a few load-balanced web servers will be interested in the hardware. So far only Cloud / hyperscale data center oriented players like Rackspace have picked up the Open Compute idea, but a lot of other people could benefit from buying these kind of "keep it simple" servers in smaller quantities.
Looking back over the past few years, a significant part of the innovation in IT has been the result of people building upon or being inspired by open source software (think Android, Amazon's EC2, iOS, Hyper-V...). We look forward to meeting the new data center and hardware technologies that the Open Compute Project will inspire.