Original Link: https://www.anandtech.com/show/9138/open-compute-hardware-tried-and-tested
The Next Generation Open Compute Hardware: Tried and Tested
by Johan De Gelas & Wannes De Smet on April 28, 2015 12:00 PM ESTFour years ago we reviewed Facebook's self-designed and open source server, Freedom. Coinciding with releasing their datacenter design, they founded the Open Compute Foundation, creating a home for the design documents, licenses and bringing vendors together under one roof.
Today, the Open Compute Foundation is doing well. Many high profile companies such as Yandex, IBM and Intel are members, and in 2014 Microsoft joined the initiative as a second major open hardware partner; releasing its Open Cloud Server designs to a pool of open hardware spanning everything from servers and switches to datacenters. The international community has grown substantially, with Summits in all major continents attracting a larger crowd each time. More so, the number of companies adopting hardware from OCP and contributing back is rising as well, and with good reason: Mark Zuckerberg indicated at the 2015 Summit in San Diego that Facebook achieved a $2 billion (USD) cost reduction (!), achieved in part by using open purpose-built hardware instead of regular proprietary gear.
To Follow and Like, at Scale
When Facebook talks about the incredible amounts of money it has saved by using Open Compute, keep in mind that even though the vanity-free hardware is designed to be cost-effective, it's the actual software that enables hardware efficiency. The company has always sought to use commonly available software components to build its services, and when it picked one, plenty of time and money are spent on performance engineering to optimize the software. Performance engineering that enables Facebook to handle 6 billion likes, 930 million photos and 12 billion messages every day, and those numbers only represent activity its social network product generates; Instagram and WhatsApp are not exactly toy workloads either.
Facebook has carried out plenty of improvements in open-source projects, or started new implementations, most of them contributed back to the community, good karma indeed. Notable performance-related projects are HHVM -- short for HipHop VM, a JIT'ing PHP virtual machine with its own PHP language dialect called Hack, adding static typing in the mix. HHVM is advertised as 'a more predictable PHP', and if Wikipedia's migration serves as any indication it makes existing PHP powered sites quite a bit faster as well. Another initiative called rocksdb, implements a properly scaleable, persistent key-value store for fast storage written in C++ and presto, a distributed SQL engine for Big Data stored in common storage systems like Cassandra, JDBC DBs and HDFS. Presto provides functionality similar to Hive, and is currently in use at Dropbox and Airbnb.
Scale Your Devops
A (dev)ops engineer at Facebook can request any kind of hardware configuration he likes, as long it is one of the following five:
Increasing the number of parts used in your systems exponentially increases the amount of money and time that will be spent in procuring, validation and maintenance. And keep in mind that in order to avoid vendor lock-in, each configuration must be available from at least two suppliers, so it's easy to imagine why FB prefers to keep it simple when it comes to servers: 5 SKUs using the same base platform, each targeted at a certain kind of application.
The SKUs mentioned are variations on Facebook's latest Xeon E5 server platform, Leopard, reviewed in detail in this article. The web tier is concerned with gathering every piece of information from the entire stack and rendering it to HTML/JSON, for which it needs a decent CPU, but not much else. Object storage, like photos is quite the opposite, and requires just a simple CPU (Atom C2000) to serve object from a large storage backend. At the other end of the spectrum, you have the data crunching units that require a decent chunk of processing power, memory capacity and I/O.
Server Generations
Freedom laid the groundwork for new generations to come. In 2012, Facebook released the specs for its second server design, dubbed "Windmill", which brought updated mainboards supporting Intel's Sandy Bridge-EP platform and AMDs Opteron 6200/6300 CPUs.
Following table details all Facebook-designed OCP servers, starting with Freedom, with highlighted differences.
Facebook-designed OCP Server Generations | ||||||
Freedom (Intel) | Freedom (AMD) | Windmill (Intel) | Watermark (AMD) | Winterfell | Leopard | |
Platform | Westmere-EP | Interlagos | Sandy Bridge-EP | Interlagos | Sandy Bridge-EP / Ivy-Bridge EP | Haswell-EP |
Chipset | 5500 | SR5650/ SP5100 |
C602 | SR5650/ SR5670/ SR5690 |
C602 | C226 |
Models | X5500/ X5600 |
Opteron 6200/6300 | E5-2600 | Opteron 6200/6300 | E5-2600 v1 / v2 | E5-2600v3 |
Sockets | 2 | 2 | 2 | 2 | 2 | 2 |
Max TDP allowed (in Watt) | 95 | 85 | 115 | 85 | 115 | 145 |
RAM per socket | 3x DDR3 | 12x DDR3 | 8x DDR3 | 8x DDR3 | 8x DDR3 | 8x DDR4 /NVDIMM |
~ Node Width (inch) | 21 | 21 | 8 | 21 | 6.5 | 6.5 |
Form factor (height in rack U) | 1.5 | 1.5 | 1.5 | 1.5 | 2 | 2 |
Fans per node | 4 | 4 | 2 | 4 | 2 | 2 |
Fan width (in mm) | 60 | 60 | 60 | 60 | 80 | 80 |
Disk bays (3.5'') | 6 | 6 | 6 | 6 | 1 | 1 |
Disk interface | SATA II | SATA II | SATA III | SATA III | SATA III / RAID HBA | SATA III / M.2 |
Amount of DIMM slots per socket | 9 | 12 | 9 | 12 | 8 | 8 |
DDRX gen | 3 | 3 | 3 | 3 | 3 | 4 |
Ethernet connectivity | 1 GbE fixed | 2 GbE fixed | 2 GbE fixed + PCIe mezz | 2 GbE fixed | 1GbE fixed + 8x PCIe Mezz | 8x PCIe Mezz |
Deployed in | Prineville, Oregon | Prineville | Lulea, Sweden | Lulea | Altoona | ? |
PSU model | PowerOne SPAFCBK- 01G | PowerOne SPAFCBK- 01G | PowerOne | PowerOne | N/A | N/A |
PSU count | 1 | 1 | 1 | 1 | N/A | N/A |
PSU capacity (in Watt) | 450 | 450 | 450 | 450 | N/A | N/A |
Amount of nodes per sled | 1 | 1 | 2 | 2 | 3 | 3 |
BMC | No (Intel RMM) | No | No (Intel RMM) | No | No (Intel RMM) | Yes (Aspeed AST1250 w 1GB Samsung DDR3 DIMM K4B1G1646G- BCH9 ) |
Facebook indicated that its expected lifespan for its Freedom nodes is around three years, and is in the process of swapping out deprecated equipment with OpenRack v1 based equipment.
Integrate: OpenRack
After Windmill and Watermark, the time was right another round of consolidation, bringing the rack design into the mix.
An issue with the Freedom servers was that the PSU was non-redundant (resulting in the Dragonstone design) and regularly had a larger power capacity than ever would be needed, which popped up on Facebook's efficiency radar. Adding another PSU in every server would mean an increased CAPEX and OPEX, because even when using an active/passive mode, the passive PSU is still using power.
A logical conclusion then was that this problem could not be solved within the server chassis, but could instead be solved by grouping power supplies of multiple servers. This resulted in OpenRack v1, a rack built with grouped power supplies on 'power shelves' supplying 12.5V DC for a 4.2kW 'power zone'. Each power zone has one power shelf (3 OpenUnits high), a highly available group of power supplies with a 5+1 redundancy feeding 10 OU (OpenU, 1 OU = 48mm) rack units of equipment. When power demand is low, a number of PSUs are automatically powered off allowing the remaining PSUs to operate much closer to their optimal points in the efficiency curve.
Another key improvement over regular racks was the power distribution system, which got rid of power cables and their need to be (dis)connected each time a server is serviced. Power is provided by three vertical power rails called a bus bar, with a rail segment for each power zone. After fitting each server with a bus bar connector, you can now simply slide in the server, the connector hot-plugs into the rail at the back, done. An additional 2 OU equipment bay was placed at the top for switching equipment.
Mass Hot Storage: Knox
For OpenRackv1 to work, a new server design was needed that implemented bus bar connectors at the back. Using an updated Freedom-based chassis minus the PSU would cause a fair bit of empty space. Simply filling the entire space with 3.5" HDDs is wasteful, as most of Facebook's workloads aren't so storage hungry. The solution proved to be very similar to the power shelf, namely grouping the additional node storage outside the server chassis on a purpose built shelf: Knox was born.
OCP Knox with one disk sled out (Image Courtesy The Register)
Put simply, Knox is a regular JBOD disk enclosure built for OpenRack which needs to be attached to host bus adapters of surrounding Winterfell compute nodes. It differs from standard 19" enclosures for two main reasons: it can fit 30 3.5" hard disks, and it makes the job of maintenance quite easy. To replace a disk, one must simply slide out the disk sled, pop open the disk bay, swap the disk, close the bay and slide the tray back into the rack. Done.
Object Storage
Seagate has contributed the specification of a "Storage device with Ethernet interface", also known by its productized version as Seagate Kinetic. These hard disks are meant to cut out the middle man and provide an object storage stack directly on the disk, in OCP speak this would mean the Knox node would not need to be connected to a compute instance but can be directly connected to the network. Seagate, together with Rausch Netzwerktechnik, has released the 'BigFoot Storage Object Open', a new chassis designed for these hard disks, with 12x 10GbE connectivity in a 2 OU form factor.
The concept of the BigFoot system is not unknown to Facebook either, as they have released a system with a similar goal, called Honey Badger. Honey Badger is a modified Knox enclosure and pairs with a compute card -- Panther+ -- to provide (cold) object storage services for pictures and such. Panther+ is fitted with an Intel Avoton SoC (C2350 for low end up to C2750 for high end configurations), up to four enabled DDR3 SODIMM slots, and mSATA/M.2 SATA3 onboard storage interfaces. This plugs onto the Honey Badger mainboard, which in turn contains the SAS controller, SAS expander, AST1250 BMC, two miniSAS connectors and a receptacle for a 10GbE OCP mezzanine networking card. Facebook has validated two configurations for the Honey Badger SAS chipset, one based on the LSI SAS3008 chip and LSI SAS3x24R expander, the other configuration consists out of the PMC PM8074 controller joined by the PMC PM8043 expander.
Doing this eliminates the need for a 'head node', usually a Winterfell system (Leopard will not be used by Facebook to serve up Knox storage), replaced by the more efficient Avoton design on the Panther card. Another good example of modularity and lock-in free hardware design, another dollar saved.
Cold Storage
A slightly modified version of Knox is used for cold storage, with specific attention being made to running the fans slowly and only spinning a disk when required.
Facebook meanwhile has built another cold storage solution, this time using an OpenRack filled with 24 magazines of 36 cartridge-like containers, each of which holds 12 Blu-ray discs. Apply some maths and you get a maximum capacity of 10,368 discs, and knowing you can fit up to 128GB on a single BD-XL disc, you have a very dense data store of up to 1.26PB. Compared to hard disks optical media touts greater reliability, with Blu-ray discs having a life expectancy of 50 years and some discs could even be able to live on for a century.
The rack resembles a jukebox; whenever a data is requested from a certain disk, a robot arm takes the cartridge to the top, where another systems slides the right discs into one of the Blu ray readers. This system serves a simple purpose: getting as much data as possible stored in a single rack, with access latency not being hugely important.
The Next Generation: Winterfell
With the PSU and hard disks removed from the server, the only items left in a Freedom chassis were the motherboard, fans and one boot drive. When Facebook's engineers put these things together in a smaller form factor, they created Winterfell, a compute node that's similar to a Supermicro twin node, except in ORv1 three nodes can be placed on a shelf. One Winterfell node is 2 OU high, and consists of a modified Windmill motherboard, a bus bar connector, and a midplane connecting the motherboard to the power cabling and fans. The motherboard is equipped with a slot for a PCIe riser card – on which a full size x16 PCIe card and a half size x8 card can be placed – and a x8 PCIe mezzanine connector for network interfaces. Further connectivity options include both a regular SATA and mSATA connector for a boot drive.
Winterfell nodes, fitted here with optional GPU
But Facebook's ever advancing quest for more efficiency found another target. After an ORv1 deployment in Altoona, it became apparent that having three power zones with three bus bars each was capable of delivering far more power than needed, so they took that information and went on to design the successor, the aptly named OpenRack v2. OpenRack v2 only uses two power zones instead of three, and FB's implementation has only one bus bar segment per power zone, bringing further cost reductions (though the PDU built into the rack is still able to power three). The placement of the power shelves relative to the bus bars was given another thought, this time they were put in the middle of a power zone because of the voltage drop in the bus bar when conducting power from the bottom to the top server. ORv2 also allows for some more ToR machinery by increasing the top bay height to 3 OU.
Open Rack v2, with powerzones on the left. ORv2 filled with Cubby chassis on the right. The chassis marked by the orange rectangle fit 12 nodes. (Image Courtesy Facebook)
The change in power distribution resulted in incompatibility with Winterfell, as the bus bars for the edge nodes are now missing, and so Project Cubby saw the light of day. Cubby is very similar to a Supermicro TwinServer chassis, but instead of having PSUs built in, it plugs into the bus bar and wires three internal power receptacles to power each node. The Winterfell design also needed to be shortened depth-wise, so the midplane was removed.
Another cut in upfront costs was realized by using three (2+1 redundancy) 3.3 kW power supplies instead of six per power zone. With the amount of power zones decreased, each of the two power zones can deliver up to 6.3 kW in ORv2. This freed up some space on the power shelf, so the engineers decided to end the separate battery cabinet that was placed next to the racks. The bottom half of the power shelf now contains three Battery Backup Units (BBU). Each BBU comes with a matrix of Li-ion 18650 cells, equal to those found in a Tesla's battery pack, capable of providing 3.6 kW to bridge the power gap until the generators kick in. Each BBU is paired to a PSU, when the PSU drops out the BBU is active until power is restored.
BBU Schematic (Image Courtesy Facebook)
To summarize, by broadening focus to your bog-standard EIA 19" rack and developing a better integration with the servers and battery cabinets, Facebook was able to reduce costs and added capacity for an additional rack per row.
The Latest and Greatest: Leopard
Leopard, the latest update to the Windmill motherboard, is equipped with the Intel C226 chipset to support up to two E5-2600v3 Haswell Xeons.
One processor mode is fully supported, in which the CPU can access all RAM onboard. Increased thermal margins (mainly because of upping the chassis height to 2 OU in Winterfell), bigger CPU heatsinks, and better airflow guidance allow the system to receive CPUs with a maximum TDP of 145 Watt, which means you can insert every Xeon except for E5-2687W v3 (160W TDP). Only eight DIMM channels are connected per CPU, but DDR4 allows for a maximum capacity of 128GB per DIMM resulting in a theoretical maximum of 2TB RAM, which Facebook reckons is plenty for years to come. New in this generation is that you can now plug NVDIMM modules (persistent flash storage on a DIMM form factor), which Facebook is testing to see if it can replace PCIe-based add-in cards.
Besides the generational CPU update, other major changes include the removal of the onboard external PCIe connector, support for a mezzanine card with dual QSFP receptacles, a TPM header, the addiction of an mSATA/M.2 slot for SATA/NVMe based storage, and 8 more PCIe lanes routed to the riser card slot for a total of 24. The SAS connector has been removed, as Leopard will not be used as a head node for Knox.
Leopards, with the optional debug board (power/reset buttons and serial-to-USB) plugged in
A big addition to the board is a baseboard management controller (BMC). A simple headless Aspeed AST1250 controller provides traditional Out Of Band IPMI access to query sensor and FRU data, control system power and provide Serial Over Lan. But Facebook taught it some new tricks: to aid bare-metal debugging, it keeps 256 post codes in buffer, offers 128kB of serial console output, and you are able to remotely dump MSR data, which is done automatically after the IERR/MCERR signal is active.
A rather unique feature of the BMC is that it allows you to update the CPLD, VR, BMC and UEFI firmware (basically all the firmware present on the motherboard) remotely, a feature also fully validated by all suppliers of the mentioned components. Another feature that's been added is average power reporting, the BMC keeps a buffer of 600 power measurements, and permits you to query the buffer for a specific interval via IPMI. To improve the accuracy of the power sensor data, factory determined (non)-linear compensations are applied to the measured power usage. Lastly, another unique feature that stems from better rack-level integration is the ability to throttle CPU power usage when power demand in the power zone exceeds capacity – for instance when a PSU dies. When the load increases to the PSU capacity, it executes a quick temporary drop to 1 Volt. This triggers an 'Under Voltage' condition in the servers which in turns activates the Fast Proc Hot signal on the CPUs, causing them to clock down for a certain amount of time and thus decreasing PSU load, allowing it to remain active instead of shutting down.
Benchmark Configuration: Leopard Under Stress
To evaluate how the Leopard platform performs, we put it against two Haswell-powered alternatives: Intel's Decathlete reference platform, and a Dell R730. Please note that we did not have a power supply similar to the Dell R730 in-house, and PSUs with comparable efficiency failed to keep the 12V rail high enough, so the PSU of a Freedom node was used. Normally Leopard is powered by a power shelf with better efficiency, so power usage should be lower when deployed in a complete OR setup as compared to ours.
Leopard coupled to Freedom PSU
To test the servers with a workload similar to what is being run at Facebook, we selected the following benchmarks:
- OLAP: a database workload built by a news aggregator website, putting stress on most of the system (CPU, RAM and disk)
- ElasticSearch: search queries on an index of Wikipedia, a CPU and memory intensive workload
System | Intel CPU | RAM | Disk | PSU |
FB Leopard | E5-2670 v3 @ 2.30GHz | Samsung 256 GB DDR4 2133MHz LRDIMM M386A4G40DM0-CPB | Intel SSD DC S3500 240GB | PowerOne SPAFCBK-01G |
Dell R730 | E5-2670 v3 @ 2.30GHz | Samsung 265 GB DDR4 2133MHz LRDIMM M386A4G40DM0-CPB | Intel SSD DC S3500 240GB | Dell D495E-S1 495W 80Plus Platinum |
Intel Decathlete v2 | E5-2670 v3 @ 2.30GHz | Samsung 265 GB DDR4 2133MHz LRDIMM M386A4G40DM0-CPB | Intel SSD DC S3500 240GB | Intel S1100ADU00 1100W 80Plus Platinum |
All tests were executed over a 10GbE connection.
Benchmark Results
Though the systems are equipped with similar components, some notable observations can be made. Each benchmark was executed three times to verify the accuracy of our results. All ratios are calculated using the Dell R730 as baseline (=100%).
OLAP | 95pct Response Time (Lower is better) |
Power (Lower is better) |
Throughput (Higher is better) |
|||
Concurrency | Facebook Leopard |
Intel Decathlete v2 |
Facebook Leopard |
Intel Decathlete v2 |
Facebook Leopard |
Intel Decathlete v2 |
5 | 114% | 111% | 99% | 99% | 94% | 117% |
10 | 108% | 105% | 99% | 98% | 96% | 109% |
25 | 96% | 94% | 100% | 101% | 100% | 116% |
50 | 98% | 56% | 103% | 110% | 101% | 112% |
100 | 96% | 49% | 110% | 149% | 103% | 119% |
250 | 98% | 67% | 105% | 144% | 103% | 120% |
250 | 95% | 67% | 106% | 144% | 103% | 120% |
250 | 95% | 67% | 105% | 143% | 102% | 120% |
250 | 90% | 62% | 106% | 143% | 103% | 121% |
An interesting result: the Dell outperforms both Open Compute servers when it comes to energy efficiency. It is good to remember that the PSU used for the Facebook server is not optimal (it should get its power from a power shelf). Leopard is very close, and gets a slightly better performance mark, while Decathlete is a respectable 20% faster, but uses 40% more power in the process.
Next, we tested with our real-world Elasticsearch benchmark. Due to ES's internal queuing algorithm, the throughput and response time can vary wildly when it drops a 'heavy' request to allow it to pass many smaller ones, and we've seen the inverse happen as well. These tests were executed five times to confirm ensure the consistency of most results.
ElasticSearch | 95pct Response Time (Lower is better) |
Power (Lower is better) |
Throughput (Higher is better) |
|||
Concurrency | Leopard | Decathlete v2 | Leopard | Decathlete v2 | Leopard | Decathlete v2 |
5 | 115% | 73% | 107% | 106% | 96% | 104% |
10 | 88% | 88% | 75% | 94% | 101% | 101% |
25 | 74% | 26% | 90% | 98% | 118% | 165% |
50 | 96% | 41% | 95% | 104% | 100% | 153% |
100 | 97% | 74% | 98% | 108% | 107% | 124% |
200 | 103% | 85% | 97% | 112% | 108% | 129% |
200 | 96% | 75% | 100% | 110% | 108% | 130% |
200 | 89% | 66% | 95% | 112% | 110% | 137% |
200 | 96% | 80% | 95% | 108% | 113% | 128% |
250 | 92% | 70% | 97% | 113% | 101% | 121% |
Whereas the Dell and Leopard performed comparably in the OLAP test, ElasticSearch tests results are in favor of the Leopard. The throughput and response times are slightly better and Leopard uses slightly less power resulting in a tangibly better performance/watt with a sub optimal power supply. The Decathlete system gets the best scores on the board again, but this time the power usage increase is relatively modest: 8 to 13 percent.
Visiting Facebook's Hardware Labs
We visited Facebook's hardware labs in September, an experience resembling entering the chocolate factory from Charlie and the Chocolate Factory; though the machinery was far less enjoyable to chew on. More importantly though, we were already familiar with the 'chocolate', in that by reading the specifications and following OCP related news, most of the systems present in their labs we could point out and name.
Wannes, Johan, and Matt Corddry, director of hardware of engineering in the Facebook Hardware labs
This symbolizes one of the ultimate goals for the Open Compute project: complete standardization of the datacenter out of commodity components that can be sourced from multiple vendors. And when the standards do not fit your exotic workload, you have a solid foundation to start from. This approach has some pleasant side effects: when working in an OCP powered datacenter, you could switch jobs to another OCP DC and just carry on doing sysadmin tasks -- you know the system, you have your tools. When migrating from a Dell to HP environment for example, the switch will be a larger hurdle due to differentiation by marketing.
Microsoft's Open Cloud Server v2 spec actually goes the extra mile by supplying you with an API specification and implementation in the Chassis controller, giving devops a REST API to manage the hardware.
Intel Decathlete v2, AMD Open 3.0, and Microsoft OCS
Facebook is not the only vendor to contribute open server hardware to open compute project either; Intel and AMD joined pretty soon after the OCP was founded, and last year Microsoft joined the party as well in a big way. The Intel Decathlete is currently in its second incarnation with Haswell support. Intel uses its Decathlete motherboards, which are compatible with ORv1, to build its reference 1U/2U 19" server implementations. These systems are seen in critical environments, like High Frequency Trading systems, where the customers want a server built by the same people who built the CPU and chipset, just so it all ought to work well together.
AMD has its Open 3.0 platform, which we detailed in 2013. This server platform is AMD's way of getting its foot in the door of OCP hyperscale datacenters, certainly when considering price. AMD seems to be taking a bit of a break improving its regular Opteron x86 CPUs, and we wonder if we might see the company bring its AMD Opteron-A ARM64 platform (dubbed 'Seattle') into the fold.
Microsoft brought us its Open Cloud Server (v2), systems that basically power all of Microsoft's cloud services (e.g. Azure), which is a high-density blade-like solution for standard 19" racks.
A 12U chassis, equipped with 6 large 140x140mm fans, 6 power supplies, and a chassis manager module carries 24 nodes. Similar to Facebook's servers, there are two node types: one for compute, one for storage. A major difference however is that the chassis provides network connectivity at the back using a 40 QSFP+ port and a 10 SFP+ port for each node. The compute nodes mate with the connectors inside the chassis, the actual network cabling can remain fixed. The same principle is applied to the storage nodes, where the actual SAS connectors are found on the chassis, eliminating the need for cabling runs to connect the compute and JBOD nodes.
A V2 compute node comes with up two Intel Haswell CPUs, with a 120 Watt maximum thermal allowance, paired to the C610 chipset and with 16 DIMM DDR4 slots to share, for a total memory capacity of 512GB. Storage can be provided through one of the 10 SATA ports or via NVMe flash storage. The enclosure provides space for four 3.5" hard disks, four 2.5" SSDs (though space is shared between two of the bottom SSD slots), and a NVMe card. A mezzanine header allows you to plug in a network controller or a SAS controller card. Management of the node can be done through the AST1050 BMC providing standard IPMI functionality, in addition a serial console of each node is available at the chassis manager as well.
The storage node is a JBOD in which then 3.5" SATA III hard disks can be placed, all connected to a SAS expander board. The expander board then connects to the SAS connectors on the tray backplane, where they can be linked to a compute node.
Networking
Server contributions aren't the only things happening under the Open Compute project. Over the last couple of years a new focus on networking was added. Accton, Alpha Networks, Broadcom, Mellanox and Intel have each released a draft specification of a bare-metal switch to the OCP networking group. The premise of standardized bare-metal switches is simple: you can source standard switch models from multiple vendors, and run the OS of your choosing on it, along with your own management tools like Puppet. No lock-in and almost no migration path to be concerned with when implementing different equipment.
To that end, Facebook created Wedge, a 40G QSFP+ ToR switch together with the Linux-based FBOSS switch operating system to spur development in the switching industry, and, as always, to offer a better value for the price. FBOSS (along with Wedge) was recently open sourced, and in the process accomplished something far bigger: convincing Broadcom to release OpenNSL, an open SDK for their Trident II switching ASIC. Wedge's main purpose is to decrease vendor dependency (e.g. choose between an Intel or ARM CPU, choice of switching silicon) and allow consistency across part vendors. FBOSS lets the switch be managed with Facebook's standard fleet management tools. And it's not Facebook alone who can play with Wedge anymore, as Accton announced it will bring a Wedge based switch to market.
Facebook Wedge in all its glory
Logical structure of the Wedge software stack
But in Facebook's leaf-spine network design, you need some heavier core switches as well, connecting all the individual ToR switches to build the datacenter fabric. Traditionally those high-capacity switches are sold by the big network gear vendors like Cisco and Juniper, and at no small cost. You might then be able to guess what happens next: a few days ago Facebook launched '6-pack', its modular, high-capacity switch.
Facebook 6-pack, with 2 groups of line/fabric cards
A '6-pack' switch consists of two module types: line cards and fabric cards. A line card is not so different from a Wegde ToR switch, where 16 40GbE QSFP+ ports at the front are supplied with 640Gbps of the 1.2Tbps ASIC's switching capacity; the main difference with Wedge is the remaining 640Gbps is linked to a new backside Ethernet-based interconnect, all in a smaller form factor. The line card also has a Panther micro server with BMC for ASIC management. In the chassis, there are two rows of two line cards in one group, each operating independently of the other.
Line card (note the debug header pins left to the QSFP+ ports)
The fabric card is the bit connecting all of the line cards together, and thus the center part of the fabric. Though the fabric switch appears to be one module, it actually contains two switches (two 1.2Tbps packet crunchers, each paired to a Panther microcontroller), and like the line cards, they operate separate from each other. The only thing being shared is the management networking path, used by the Panthers and their BMCs, along with the management ports for each of the line cards.
Fabric card, with management ports and debug headers for the Panther cards
With these systems, Facebook has come a long way towards making its entire datacenter networking built with open, commodity components and running it using open software. The networking vendors are likely to notice these developments, and not only because of their pretty blue color.
ONIE
An effort to increase modularity even more is ONIE, short for the Open Network Install Environment. ONIE is focused on eliminating operating system lock-in by providing an environment for installing common operating systems like CentOS and Ubuntu on your switching equipment. ONIE is baked into the switch firmware, and after installation the onboard bootloader (GRUB) directly boots the OS. But before you start writing your Puppet or Chef recipes to manage your switches, a small but important side-note needs to be added: to operate the switching silicon of the Trident ASIC you need a proprietary firmware blob from Broadcom. And up until very recently, Broadcom would not give you the firmware blob unless you have some kind of agreement with them. This is why, currently, the only OSs you can install on ONIE enabled switches are commercial OSes like BigSwitch and Cumulus, who have agreements in place with the silicon vendors.
Luckily, Microsoft, Dell, Facebook, Broadcom, Intel and Mellanox have started work on a Switch Abstraction Interface (proposals), which would obviate the need for any custom firmware blobs and allow standard cross-vendor compatibility, though it remains to be seen to which degree this can completely replace proprietary firmware.
Open Compute Hardware Availability
All this Open Compute stuff might be fine and dandy, but what if your company name does not read 'Facebook' and you do not buy your servers by the datacenter? Well, it gets somewhat complicated (for smaller companies). Several of the ODMs that manufacture Open Compute gear have launched OEM subsidiaries (Quanta launched QCT, Winston has WiWynn) who in turn operate through other sales channels. OCP gear is not available to every business however, as it is mostly Built-To-Order, often with a minimum order quantity of an entire rack. Following table summarizes some of the available Winterfell OEM/retail alternatives, next to the original OCP designs.
QCT F03C | WiWynn SV7220-2S | WiWynn SV7220-2P | |
Form factor (OU) | 2 | 2 | 2 |
OCP certified | Yes | Yes | No |
Nodes per chassis | 3 | 3 | 3 |
Storage | 1x 3.5" SATA hard disk | 1x 3.5" SATA hard disk | 6x 2.5" SATA hard disk |
Max TDP | N/A | N/A | N/A |
Networking | 1x Intel 82574L GbE 1x optional OCP Mezzanine network card |
1x Intel 82574L GbE 1x optional OCP Mezzanine network card |
1x Intel 82574L GbE 1x optional OCP Mezzanine network card |
BMC | AST1250 | AST1250 | AST1250 |
Expansion slots | 1x PCIe 3.0 x8 OCP mezzanine network card 2x PCIe 3.0 x8 Full Profile card |
1x PCIe 3.0 x8 OCP mezzanine network card 2x PCIe 3.0 x8 Low Profile card |
1x PCIe 3.0 x8 OCP mezzanine network card 1x PCIe 3.0 x8 Low Profile card |
With Winterfell, the retail versions stay very close to the spec, with WiWynn offering a non-spec model with 3.5" disks. The next table shows the Leopard retail versions.
QCT F06A (Rackgo X) | Wiwynn SV7220G2-S | |
Form factor (OU) | 2 | 2 |
OCP certified | Yes | Yes |
Nodes per chassis | 4 | 3 |
Storage | 2x 2.5" disks | 1x 3.5" SATA hard disk |
Onboard storage | mSATA | 1x mSATA/M.2 SSD |
Max TDP | 135 | N/A |
Networking | 1x BMC GbE 1x optional OCP Mezzanine network card |
1x Intel I210-AT GbE 1x optional OCP Mezzanine network card |
BMC | AST2400 with video card | AST2400 BMC |
Expansion slots | 1x PCIe 3.0 x8 OCP mezzanine network card 1x PCIe 3.0 x8 Low Profile card |
1x PCIe 3.0 x8 OCP mezzanine network card 1x PCIe 3.0 x8 Low Profile card 1x PCIe 3.0 x16 Low Profile card |
With Leopard, Quanta's QCT went a different route, by scaling the node into a quad system based on v3 of the Intel Motherboard, while WiWynn offers a standard spec version and a version with more drive bays.
On a more general note, in Europe – and presumably in other markets as well – the OEMs do not appear to be publicly pushing OCP equipment, rather they're simply appeasing market demands. Should OCP OEMs want to go up against Dell and HP, more will be needed than some news articles and merit indications; advertising to the 'IT decision makers' might be welcome.
Finally, if you do happen to be a large-volume customer in the United States, you have multiple vendors to choose and get support from. Even HP will sell you OCP gear when the total price tag reaches interesting levels.
Open Compute Compliance and Interop
With multiple vendors of OpenCompute hardware popping up, the logical next step was to make sure they correctly implemented the specification. Two labels were created to indicate what sort of compatibility testing was performed: OCP Ready and OCP Certified. The OCP Ready sticker is free to use by vendors to indicate that their apparatus complies with the spec and that it is able to work within an OCP environment. The OCP Certified label however is only issued by approved testing facilities, of which there are just two: one located within the University Of Texas at San Antonio (UTSA), another in Taiwan at the Industrial Technology Research Institute (ITRI). The first vendors to become OCP certified were WiWynn and Quanta QCT, meanwhile AMD and Intel have also received certification for their reference systems.
The OCP Certification Center (ITRI) has published the certification specifications used for Leopard, Winterfell and Knox, along with the test kits, allowing device owners to run the tests themselves. The test specifications itemize the essential features of the hardware and describe the pass or fail condition for each individual test. The openness of the test kits allows OCP vendors to just pass al tests in their own facility, ship it out to the validation center, pay for validation testing, and leave with OCP certified gear.
How The Open Compute Project Will Impact Your Datacenter
Facebook's initiative caused quite a ripple through the large-volume datacenter equipment vendors. With Microsoft joining as a second major contributor by way of donating its Open Cloud Server suite, the Open Compute project has gained substantial momentum. Currently we've seen a large part of the OCP contributions come from parties who designed hardware themselves to operate more efficiently, but aren't vendors in the hardware space themselves, like Facebook and Microsoft.
Meanwhile others have contributed information on various topics such as Shingled Magnetic Recording disks and how racks are tested for rigidity and stability. And a next step for the vendor community would be to bring more openness to bog-standard parts (hard drives, network controllers, ...).
The result of OCP is that innovative ideas in hardware and datacenter design are quickly being tested in the real world and ultimately standardized. This is in significant contrast to the "non-OCP world", where brilliant ideas are mostly PowerPoint slides and only materialize as proprietary solutions for those with deep pockets. The Open Rack innovation is a perfect example of this: we have seen lots of presentations of rack consolidated power supply and cooling, but the innovative solution was only available as closed expensive proprietary systems with lots of limitations and vendor lock-in solutions. Thanks to OCP, this kind of innovation not only becomes available to a much larger public, but we get compatible, standardized products (i.e. servers that can plug into the power shelves of those racks) from multiple vendors.
P.S. Note from Johan: Wannes De Smet is my colleague at the Sizing Servers Lab (Howest), who has done a lot of research work around OpenCompute.