Google is using machine learning and artificial intelligence to wring even more efficiency out of its mighty data centers.
In a presentation today at Data Centers Europe 2014, Google’s Joe Kava said the company has begun using a neural network to analyze the oceans of data it collects about its server farms and to recommend ways to improve them. Kava is the Internet giant’s vice president of data centers.
In effect, Google has built a computer that knows more about its data centers than even the company’s engineers. The humans remain in charge, but Kava said the use of neural networks will allow Google to reach new frontiers in efficiency in its server farms, moving beyond what its engineers can see and analyze.
Google already operates some of the most efficient data centers on earth. Using artificial intelligence will allow Google to peer into the future and model how its data centers will perform in thousands of scenarios.
In early usage, the neural network has been able to predict Google’s Power Usage Effectiveness with 99.6 percent accuracy. Its recommendations have led to efficiency gains that appear small, but can lead to major cost savings when applied across a data center housing tens of thousands of servers.
Why turn to machine learning and neural networks? The primary reason is the growing complexity of data centers, a challenge for Google, which uses sensors to collect hundreds of millions of data points about its infrastructure and its energy use.
“In a dynamic environment like a data center, it can be difficult for humans to see how all of the variables interact with each other,” said Kava. “We’ve been at this (data center optimization) for a long time. All of the obvious best practices have already been implemented, and you really have to look beyond that.”
Enter Google’s ‘Boy Genius’
Google’s neural network was created by Jim Gao, an engineer whose colleagues have given him the nickname “Boy Genius” for his prowess analyzing large datasets. Gao had been doing cooling analysis using computational fluid dynamics, which uses monitoring data to create a 3D model of airflow within a server room.
Gao thought it was possible to create a model that tracks a broader set of variables, including IT load, weather conditions, and the operations of the cooling towers, water pumps and heat exchangers that keep Google’s servers cool.
“One thing computers are good at is seeing the underlying story in the data, so Jim took the information we gather in the course of our daily operations and ran it through a model to help make sense of complex interactions that his team – being mere mortals – may not otherwise have noticed,” Kava said in a blog post. “After some trial and error, Jim’s models are now 99.6 percent accurate in predicting PUE. This means he can use the models to come up with new ways to squeeze more efficiency out of our operations. ”
Gao began working on the machine learning initiative as a “20 percent project,” a Google tradition of allowing employees to spend a chunk of their work time exploring innovations beyond their specific work duties. Gao wasn’t yet an expert in artificial intelligence. To learn the fine points of machine learning, he took a course from Stanford University Professor Andrew Ng.
Neural networks mimic how the human brain works, allowing computers to adapt and “learn” tasks without being explicitly programmed for them. Google’s search engine is often cited as an example of this type of machine learning, which is also a key research focus at the company.
“The model is nothing more than series of differential calculus equations,” Kava explained. “But you need to understand the math. The model begins to learn about the interactions between these variables.”
Gao’s first task was crunching the numbers to identify the factors that had the largest impact on energy efficiency of Google’s data centers, as measured by PUE. He narrowed the list down to 19 variables and then designed the neural network, a machine learning system that can analyze large datasets to recognize patterns.
“The sheer number of possible equipment combinations and their setpoint values makes it difficult to determine where the optimal efficiency lies,” Gao writes in the white paper on his initiative. “In a live DC, it is possible to meet the target setpoints through many possible combinations of hardware (mechanical and electrical equipment) and software (control strategies and setpoints). Testing each and every feature combination to maximize efficiency would be unfeasible given time constraints, frequent fluctuations in the IT load and weather conditions, as well as the need to maintain a stable DC environment.”