Name: Free Cooling: the Server Side of the Story
Item: Free Cooling: the Server Side of the Story
Author: Johan De Gelas

Original Link: https://www.anandtech.com/show/7723/free-cooling-the-server-side-of-the-story

Free Cooling: the Server Side of the Story

VIEW ARTICLE

by Johan De Gelas on February 11, 2014 7:00 AM EST

48 Comments

Data centers are the massive engines under the hood of the mobile internet economy. And it is no secret that they demand a lot of energy: with energy capacities ranging from 10MW to 100MW, they require up to 80,000 times more than what a typical US home needs.

And yet, you do not have to be a genius to figure out how the enormous energy bills could be reduced. The main energy gobblers are the CRACs, Computer Room Air Conditioners or the alternative, the CRAHs, the Computer Room Air Handlers. Most data centers still rely on some form of mechanical cooling. And to the outsider, it looks pretty wasteful, even stupid, that a data center is consuming energy to cool servers down while the outside air in a mild climate is more than cold enough most of the time (less than 20°C/68 °F).

Free cooling

There are quite a few data centers that have embraced "free cooling" totally, i.e. using the cold air outside. The data center of Microsoft in Dublin uses large air-side economizers and make good use of the lower temperature of the outside air.

Microsoft's data center in Dublin: free cooling with air economizers (source: Microsoft)

The air side economizers bring outside air into the building and distribute it via a series of dampers and fans. Hot air is simply flushed outside. As mechanical cooling is typically good for 40-50% of the traditional data center's energy consumption, it is clear that enormous energy savings can be possible with "free cooling".

Air economizers in the data center

This is easy to illustrate with the most important - although far from perfect - benchmark for data centers, PUE or Power Usage Effectiveness. PUE is simply the ratio of the total amount of energy consumed by the data center as a whole to the energy consumed by the IT equipment. Ideally it is 1, which means that all energy goes to the IT equipment. Most data centers that host third party IT equipment are in the range of 1.4 to 2. In other words, for each watt consumed by the servers/storage/network equipment, 0.4 to 1 Watt is necessary for cooling, ventilation, UPS, power conversion and so on.

The "single-tenant" data centers of Facebook, Google, Microsoft and Yahoo that use "free cooling" to its full potential are able to achieve an astonishing PUE of 1.15-1.2. You can imagine that the internet giants save massive amounts of energy this way. But as you have guessed, most enterprises and "multi-tenant" data centers cannot simply copy the data center technologies of the internet giants. According to a survey of more than 500 data centers conducted by The Uptime Institute, the average Power Usage Effectiveness (PUE) rating for data centers is 1.8. There is still a lot of room for improvement.

Let's see what the hurdles are and how buying the right servers could lead to much more efficient data centers and ultimately an Internet that requires much less energy.

Hurdles for Free Cooling

It is indeed a lot easier for Facebook, Google and Microsoft to operate data centers with "free cooling". After all, the servers inside those data centers are basically "expendable"; there is no need to make sure that an individual server does not fail. The applications running on top of those servers can handle an occasional server failure easily. That is in sharp contrast with a data center that hosts servers of hundreds of different customers, where the availability of a small server cluster is of the utmost importance and regulated by an SLA (Service Level Agreement). The internet giants also have full control over both facilities and IT equipment.

There are other concerns and humidity is one of the most important ones. Too much humidity and your equipment is threatened by condensation. Conversely, if the data center air is too dry, electrostatic discharge can wreak havoc.

Still, the humidity of the outside air is not a problem for free cooling as many data centers can be outfitted with a water-side economizer. Cold water replaces the refrigerant, pumps and a closed circuit replace the compressor. The hot return water passes through the outdoor pipes of the heat exchangers. If the outdoor air is cold enough, the water-side system can cool the water back to the desired temperature.

Google's data center in Belgium uses water-side cooling so well that it
does not need any additional cooling. (source: google)

Most of the "free cooling" systems are "assisting cooling systems". In many situations they do not perform well enough to guarantee the typical 20-25°C (68-77 °F) inlet temperature the whole year around that CRACs can offer.

All you need is ... a mild climate

But do we really need to guarantee a rather low 20-25°C inlet temperature for our IT equipment all year round? It is a very important question as the temperature in large parts of the worlds can be cooled with free cooling if the server inlet temperature does not need to be so low.

The Green Grid, a non-profit organization, uses data from the Weatherbank to calculate the amount of time that a data center can use air-side "free cooling" to keep the inlet temperature below 35°C. To make this more visual, they publish the data in a colorful way. Dark blue means that air-side economizers can be efficient for 8500 hours per year, which is basically year round. Here is the map of North-America:

About 75% of North-America can use free cooling if the maximum inlet temperature is raised to 35°C (95 °F). In Europe, the situation is even better:

Although I have my doubts about the accuracy of the map (the south of Spain and Greece see a lot more hot days than the south of Ireland), it looks like 99% of Europe can make use of free cooling. So how do our current servers cope with an inlet temperature up to 35 °C ?

The Server CPU Temperatures

Given Intel's dominance in the server area, we will focus on the Intel Xeons. The "normal", non-low power, Xeons have a specified Tcase of 75°C (167 °F, 95 W) to 88°C (190 °F, 130 W). Tcase is the temperature measurement using a thermocouple embedded in the center of the heat spreader, so there is a lot of temperature headroom. The low power Xeons (70 W TDP or less) have a lot less headroom as the Tcase is a pretty low 65°C (149 °F). But since those Xeons produce a lot less heat, it should be easier to keep them at lower temperatures. In all cases, there is quite a bit of headroom.

But there is more than the CPU of course; the complete server must be up for running with higher temperatures. That is where the ASHRAE specifications come in. The American Society of Heating, Refrigeration, and Air conditioning Engineers publishes guidelines for the temperature and humidity operating ranges of IT equipment. If vendors comply with these guidelines, administrators can be sure that they will not void warranties when running servers at higher temperatures. Most vendors - including HP and DELL - now allow the inlet temperature of a server to be as high as 35 °C, the so called A2 class.

ASHRAE specifications per class

The specified temperature is the so called "dry bulb" temperature, which is the normal measured temperature by a dry thermometer. Humidity should be approximately between 20 and 80%. Specially equipped servers (Class A4) can go as high as 45°C with humidity being between 10 and 90%.

It is hard to overestimate the impact of servers being capable of breathing hotter air. In modern data centers this ability could be the difference between being able to depend on free cooling only, or having to continue to invest in very expensive chilling installations. Being able to use free cooling comes with both OPEX and CAPEX savings. In traditional data centers, this allows administrators to raise the room temperature and decrease the amount of energy the cooling requires.

And last but not least, it increases the time before a complete shutdown is necessary when the cooling installation fails. The more headroom you get, the easier it is to fix the cooling problems before critical temperatures are reached and the reputation of the hosting provider is tarnished. In a modern data center, it is almost the only way to run most of the year with free cooling.

Raising the inlet temperature is not easy when you are providing hosting for many customers (i.e. a "multi-tenant data center"). Most customers resist warmer data centers, with good reason in some cases. We watched a 1U server use 80 Watt to power its fans on a total of less than 200 Watt! In that case, the savings of the data center facility are paid by the energy losses of the IT equipment. It's great for the data center's PUE, but not very compelling for customers.

But how about the latest servers that support much higher inlet temperatures? Supermicro claims their servers can work with up to 47°C inlet temperatures. It's time to do what Anandtech does best and give you facts and figures so you can decide if higher temperatures are viable.

The Supermicro "PUE-Optimized" Server

We tested the Supermicro Superserver 6027R-73DARF. We chose this particular server for two main reasons: first, it is a 2U rackmount server (larger fans, better airflow) and secondly, it was the only PUE optimized server with 16 DIMMs. Many applications are more memory capacity than CPU limited, so a 16 DIMM server is more desirable to most of our readers than an 8 DIMM server.

On the outside, it looks like most other Supermicro servers, with the exception that the upper third of the front is left open for better airflow. This in contrast with some Supermicro servers where the upper third is filled with disk bays.

This superserver has a few features to ensure that it can can cope with higher temperatures without a huge increase in energy consumption. First of all, it has an 80 Plus Platinum power supply. A platinum PSU is not exceptional anymore: almost every server vendors offers at least the slightly less efficient 80 Plus Gold PSUs. Platinum PSUs are the standard for new servers, DELL and Supermicro even started offering 80 Plus Titanium PSUs (230V).

Nevertheless, these Platinum PSUs are pretty impressive: they offer better than 92% efficiency from 20% to 100% load.

Secondly, it uses a spreadcore design. Here, the CPU heatsinks do no obstruct each other: the air flow will go over them in parallel.

Three heavy duty fans blow over a relatively simple motherboard design. Notice that even the heatsink on the 8W Intel PCH (602J chipset) is also in parallel with the CPU heatsinks. Indeed, the PCH heatsink will get an unhindered airflow. Last but not least, these servers come with specially designed air shrouds for maximum cooling.

There is some room for improvement though. It would be great to have a model with 2.5-inch drive bays. Supermicro offers a 2,5'' HDD conversion tray (MCP-220-00043-0N), but a native 2.5-inch drive bay model would give even better airflow and serviceability.

We would also like an easier way to replace the CPUs. The screws of the heatsink tend to wear out quickly. But that is mostly a problem of a lab testing servers, less a problem of a real enterprise.

How We Tested

To determine the optimal point between data center temperature and system cooling performance, we created a controlled temperature testing environment, called a "HotBox". Basically, we placed a server inside an insulated box. The box consists of two main layers: at the bottom is the air inlet where a heating element is placed. The hot air is blown inside the box and is then sucked into the front of the server on the second layer. This way we can simulate that inlet air comes from below, as in most data centers. Inlet and outlet are separated and insulated from each other, simulating the hot and cold aisles. Two thermistors measure the temperature of the inlet, one on the right and on the left, just behind the front panel.

Just behind the motherboard, close to back of the server, a pair of thermistors monitors the outlet temperature. And we'd like to thank Wannes De Smet who designed the hotbox!

The servers is fed by a standard European 230V (16 Amps max.) power line. We use the Racktivity ES1008 Energy Switch PDU to measure power consumption. Measurement circuits of most PDUs assume that the incoming AC is a perfect sine wave, but it never is. However, the Rackitivity PDU measures true RMS current and voltage at a very high sample rate: up to 20,000 measurements per second for the complete PDU.

Datamining on Hardware

Building the "Hotbox" was one thing; getting all the necessary data on the other hand is a serious challenge. A home-made PCB collects the data of the thermistors. Our vApus stress testing software interfaces with ESXi to collect hardware usage counters and temperatures; fan speeds are collected from the BMC; and power numbers from the Racktivity PDU. This is all done while placing a realistic load on the ESXi virtual machines. The excellent programming work of Dieter of the Sizing Servers Lab resulted in a large amount of data in our Excel sheets.

To put a realistic load on the machine we use our own real-life load generator called vApus. With vApus we capture real user interaction with a website, add some parameters that can be randomized, and then replay that log a number of times.

The workload consists of four VMs:

Drupal LAMP VM running sizingservers.be website
Zimbra 8 VM
phpBB LAMP VM running clone of real website
OLAP (news aggregator database)

The Drupal site gets regular site visitors mixed with the posting of new blog entries and sending email, resulting in a moderate system load. The Zimbra load is disk-intensive, consisting of users creating and sending emails, replying, creating appointments, tasks and contacts. The phpBB workload has a moderate CPU and network load, viewing and creating forum threads with rich content. Finally, the OLAP workload is based on queries from a news aggregator and is mostly CPU bound. These four VMS form one Tile (similar to VmMark "tiles"). We ran two tiles in each test, resulting in a load of 10% to 80%.

Benchmark Configuration

Since Supermicro claims that these servers are capable of operating at inlet temperatures of 47°C (117 °F) while supporting Xeons with 135 W TDPs, we tested with two extreme processors. First off is the Xeon E5 2650L at 1.8GHz with a low 70W TDP and a very low Tcase of 65°C. It's low power but is highly sensitive to high temperatures. Second, we tested with the fastest Xeon E5 available: the Xeon E5 2697 v2. The TDP is 130W for 12 cores at 2.7GHz and Tcase is 86°C. This is a CPU that needs a lot of power but it's also resistant to high temperatures.

Supermicro 6027R-73DARF (2U Chassis)

CPU	Two Intel Xeon processor E5-2697 v2 (2.7GHz, 12c, 30MB L3, 130W) Two Intel Xeon processor E5-2650L v2 (1.7GHz, 10c, 25MB L3, 70 W)
RAM	64GB (8x8GB) DDR3-1600 Samsung M393B1K70DH0-CK0
Internal Disks	8GB flash disk to boot up, 1 GbE link to iSCSI SAN
Motherboard	Supermicro X9DRD-7LN4F
Chipset	Intel C602J
BIOS version	R 3.0a (December the 6th, 2013)
PSU	Supermicro 740W PWS-741P-1R (80+ Platinum)

All C-states are enabled in both the BIOS and ESXi.

Loading the Server

The server first gets a few warm-up runs and then we start measuring during a period of about 1000 seconds. The blue lines represent the measurements done with the Xeon E5-2650L, the orange/red lines represent the Xeon E5-2697 v2. We test with three settings:

No heating. Inlet temperature is about 20-21°C, regulated by the CRAC
Moderate heating. We regulate until the inlet temperature is about 35°C
Heavy heating. We regulate until the inlet temperature is about 40°C

First we start with a stress test: what kind of CPU load do we attain? Our objective is to be able to test a realistic load for a virtualized host between 20 and 80% CPU load. Peaks above 80% are acceptable but long periods of 100% CPU load are not.

There are some small variations between the different tests, but the load curve is very similar on the same CPU. The 2.4GHz 12-core Xeon E5-2697 v2 has a CPU load between 1% and 78%. During peak load, the load is between 40% and 80%.

The 8-core 1.8GHz Xeon E5-2650L is not as powerful and has a peak load of 50% to 94%. Let's check out the temperatures. The challenge is to keep the CPU temperature below the specified Tcase.

The low power Xeon stays well below the specified Tcase. Despite the fact that it starts at 55°C when the inlet is set to 40°C, the CPU never reaches 60°C.

The results on our 12-core monster are a different matter. With an inlet temperature up to 35°C, the server is capable of keeping the CPU below 75°C (see red line). When we increase the inlet temperature to 40°C, the CPU starts at 61°C and quickly rises to 80°C. Peaks of 85°C are measured, which is very close to the specified 86°C maximum temperature. Those values are acceptable, but at first sight it seems that there is little headroom left.

The most extreme case would be to fill up all disk bays and DIMM slots and to set inlet temperature to 45°C. Our heating element is not capable of sustaining an inlet of 45°C, but we can get an idea of what would happen by measuring how hard the fans are spinning.

Fan Speed

How do these higher temperatures affect the fans?

It is clear that the fan speed algorithm takes more than just the CPU temperature and inlet temperature into account. There's definitely an ability to detect when a low power CPU with low tCase is used. As a result the fans are spinning faster with the 2650L than with the 2697 v2. That also means that the server has more headroom for the Xeon E5-2697 v2 than we first assumed based on the CPU temperature results. At higher inlet temperatures, the fans can still go a bit faster if necessary on the 2697, as the maximum fan RPM is 7000.

Power Measurements

The big question of course is how all this affects the power bill. It's no use saving on cooling if your server simply consumes a lot more power due to increased fan speeds (and potentially down time when replacing fans more frequently).

The difference in power consumed is not large between the three inlet temperatures. To make our measurements clear, we standarized on the measurements at 20°C as the baseline and created the following table:

	Xeon E5-2697 v2			Xeon E5-2650L
CPU load	Inlet 20°C	Inlet 35°C	Inlet 40°C	Inlet 20°C	Inlet 35°C	Inlet 40°C
0-10	100%	105%	106%	100%	106%	112%
10-20	100%	98%	103%	100%	104%	108%
20-30	100%	105%	110%	100%	103%	107%
30-40	100%	102%	105%	100%	102%	105%
40-50	100%	109%	108%	100%	97%	109%
50-60	100%	105%	108%	100%	108%	111%
60-70	100%	106%	107%	100%	104%	110%
70-80	100%	106%	104%	100%	105%	109%
80-90	N/A	N/A	N/A	100%	109%	108%
Average		105%	107%		103%	109%

As the fans work quite a bit harder to keep the 2650L below the low Tcase, they need a lot more power. We notice a 9% increase in power when the inlet temperature doubles. The increase is smaller with the Xeon E5, only 7%.

The most interesting conclusion is that raising the inlet temperature from 20 to 35°C results in almost no increase in power consumption (3-5%) on the server side, while the savings on cooling and ventilation can be substantial, around 40% or more.

DIMM Temperatures

Besides the CPU, we also wanted to know how the other components coped with the higher temperatures. First are the Samsung RDIMMs.

DIMMs can operate at up to 95°C, so all the measurements seem to be quite safe.

PCH Temperatures

How about the chipset?

The 8W Intel C602J is specified to work at up to 92 °C, so again there is still a lot of headroom left, even with 40°C inlet air temperature. Notice that the CPU used still has an impact on the temperature of the PCH, despite the fact that there is quite a bit of space between the PCH and the CPU heatsinks. The higher performance offered results in the chipset having to work harder as well.

Performance?

Yes, we did monitor performance. But it simply was not worth talking about: the results at 20°C inlet are almost identical to those at 40°C inlet. The only difference that lower temperatures could make is a slight increase in the amount of time spent at higher Turbo Boost frequencies, but we could not measure any significant difference. The reason is of course that some of our VMs are also somewhat disk intensive.

Conclusion

The PUE optimized servers can sustain up to 40°C inlet temperature without a tangible increase in power consumption. It may not seem spectacular but it definitely is. The "PUE optimized" servers are simply improved versions; they do not need any expensive technology to sustain high inlet temperatures. As a result, the Supermicro Superserver 6027R-73DARF cost is around $1300.

That means that even an older data center can save a massive amount of money by simply making sure that some sections only contain servers that can cope with higher inlet temperatures. An investment in air-side or water-side economizers could result in very large OPEX savings.

Reliability was beyond the scope of this article and the budget of our lab. But previous studies, for example by IBM and Google, have also shown that reasonably high inlet temperatures (lower than 40°C) have no significant effect on the reliability of the electronics.

Modern data centers should avoid servers that cannot cope with higher inlet temperature at all cost as the cost savings of free cooling range from significant to enormous. We quote a study done on a real-world data center by Intel:

"67% estimated power savings using the (air) economizer 91% of the time—an estimated annual savings of approximately USD 2.87 million in a 10MW data center"

A simple, solid and very affordable server without frills that allows you to lower the cooling costs is a very good deal.