Home World AI chip power supply needs to drive data centers to get creative

AI chip power supply needs to drive data centers to get creative

14
0

Computer chips powering your Chatgpt problem are about six times more powerful than the chips that dominate data centers a few years ago. As AI pushes a single chip to consume a lot of power, data centers are competing for more computing power per watt per watt, and the physics of keeping silicon cool may determine whether AI is sustainable.

Data center operators and researchers are scrambling to find efficiency gains as grid restrictions make new power connections increasingly difficult. However, smaller improvements (such as liquid-cooled power systems, smarter maintenance schedules and higher power distribution) are opposing fundamental physical issues and economic incentives that prioritize sustainability. The question is whether these incremental wins can keep up Artificial intelligence’s appetite for electricity.

The bet on server racks is getting higher and higher. Data centers have consumed about 4% On the U.S. power grid, that number is expected to reach 9% in the next decade. In popular markets like Virginia and Texas, power companies are so overwhelmed by the requirements for new data center connectivity that they charge millions of dollars just to study whether the grid can cope with the load.

This creates a new urgency around the old metrics: power usage efficiency or PUE, which measures how much power actually reaches the computer and how much was wasted in cooling and other overhead. Mathematics is simple, but the margins are tight. Data centers running at 1.5 pue can only provide 67% of its incoming power to the actual calculations – the rest disappears in the cooling system and power conversion losses. Even some improvements can increase energy and cost savings, said Ryan Mallory, president and chief operating officer of data center operator Flexential.

“We’re talking about dozens of percentage points,” Mallory said. “But it’s very impactful for operating costs. If you put a tenth of a percentage point on a percentage point – say you’re going to drop from 1.4 to 1.3 – you’re probably getting $50,000 per month of efficiency per megawatt of power consumption per megawatt.”

For a large facility, it adds up very quickly. A Mallory client operates a 27-megawatt AI facility, with 0.1% of PUE improvement savings of $1.35 million per month, or more than $16 million per year. More importantly, the same efficiency gain means the facility can pack more computing power into the same grid connections – crucially, it can take years for new power connections to be approved and spend millions of dollars to learn.

These benefits have become even more important given the scale of ongoing large-scale construction. according to Real estate company CBRE. On this footprint, even modest efficiency gains can translate into greater AI computing power without increasing strain on an already overwhelmed grid.

Cooling crisis

The pathways for these PUE benefits are usually dependent on physics and planning. Hyperscale operators (such as Google and Meta) can reach a PUE rating of 1.1 or 1.2, because their server farms use the same devices arranged in predictable mode, creating a consistent airflow. However, most data centers combine different customers, and these hardware combined with different hardware create what Mallory calls “chaotic airflow patterns and hotspots”, making efficient cooling even harder to achieve.

Yet, no matter how perfect the arrangement is, all data centers are fighting against the heat.

Operators become creative in managing high temperatures. Mallory’s company schedules equipment maintenance on cool morning hours to avoid energy penalties for running tests at peak temperatures. In hot climates such as Las Vegas and Phoenix, facilities use evaporative cooling systems that can be preheated in the air before entering the main cooling system, similar to Misters in outdoor restaurants. Some people can even dig out “free air cooling” in the winter and open the ventilation holes to use the cold air directly.

To handle huge power loads more efficiently, data centers must upgrade their electrical systems. To handle huge power loads more efficiently, data centers must upgrade their electrical systems. Traditional data centers use low-voltage power distribution, but AI racks now require higher voltage systems, and some operators plan to jump to 400 volts or even 800 volts.

Higher voltages allow for lowering current at the same supply, thus reducing the resistance loss of converting precious power into unwanted heat. This is a two-to-one efficiency improvement that reduces wasted energy and heat generation. But even these improvements don’t solve the basic problems of the shelf, which generates the same heat as the space heater squeezes into the closet-sized footprint.

To truly solve the thermal problems, data centers need more radical solutions. That’s why TE Connectivity and other companies have developed liquid-cooled power distribution systems (essentially water-cooled cables) that can handle more power on the same footprint as traditional systems while removing heat more efficiently.

About 30% of new data centers are built with liquid cooling systems, and this is expected to reach 50% in two to three years, said Ganesh Srinivasan, vice president of digital data networking operations at TE Connectivity.

But liquid cooling creates its own sustainability challenges: Data centers can consume millions of gallons of water each year for cooling, thus e-local water supply. Some facilities are trying to immerse cooling (actually indoctrinate the entire server in mineral oil), which completely eliminates water usage, although logistics makes most applications so far unrealistic.

Unexpected consequences of efficiency

In addition to improving infrastructure, chip manufacturers are also pursuing their own efficiency improvements. Such as AMD bets on rack-scale architecture By 2030, this can increase energy efficiency by 20 times, while newer chip designs support lower precision computing, which can greatly reduce the computing load. Nvidia’s Next Generation Blackwell GPU – even Newer Blackwell Ultra Platform – Ensure your efficiency is improved. NVIDIA CEO Jensen Huang said the company’s GPU is generally 20 times more energy-saving For some AI workloads, not traditional CPUs.

However, there is a fundamental paradox that is working with newer chips. Dan Alistarh, a professor at the Austrian Academy of Sciences and Technology who studies algorithm efficiency, said that when upgrading to the new NVIDIA chip, energy costs roughly doubled. “It’s a weird trade-off because you’re running things faster, but you’re using more energy, too,” Arista said.

The algorithm that powers AI shows a smaller efficiency progress. Researchers like Alistarh are still investigating technologies that can reduce the energy consumption of generated AI, such as using simpler mathematics, requiring less computing power. Other groups are exploring completely different architectures that can completely replace Transformers.

However, these innovations have been striving to gain appeal, as AI companies’ models perform in measuring standardized tests such as reasoning, mathematics, and language understanding are largely judged as scores that directly affect funding and market perceptions.

Companies would rather build energy models that are eager to score higher on these tests than efficient models that may lag behind competitors. The result is an industry optimized for rankings rather than sustainability, which can improve efficiency regardless of cost savings, at best a secondary issue.

“Anything that makes you lower in the benchmark rat race is a clear loss,” Arista said. “No one can afford it.”

Some of the reports in this article are part of a press residency funded by the Austrian Institute of Science and Technology.

Source link