Nvidia details AI power supply, adds Innoscience GaN
Cette publication existe aussi en Français
Nvidia has detailed the power supply design it is using for the Grace Blackwell GB300-NVL72 AI system as it adds Chinese power chip maker Innoscience to its roster of suppliers
Adding Innoscience is controversial has it is a legal dispute with another supplier, Infineon Technologies: Infineon, Nvidia team on datacentre AI power delivery
The power supply unit (PSU) for the GB300-NVL72 adds energy storage to smooth power spikes from AI workloads and reduce peak grid demand by up to 30%. This is key to avoiding putting pressure on the power grid, and the design will also be used for GB200 NVL72 systems.
Synchronized workloads
In AI training, thousands of GPUs operate in lockstep and perform the same computation on different data. This synchronization results in power fluctuations at the grid level. Unlike traditional data centre workloads, where uncorrelated tasks smooth out the load, AI workloads cause abrupt transitions between idle and high-power states. The GPUs operate synchronously, causing the total power drawn by a GPU cluster to mirror and amplify the power pattern of a single node.
To address this, the GB300 PSU uses several mechanisms across different operational phases, combining power cap, energy storage, and GPU burn mechanisms.
The power cap limits the GPU power draw at the start of a workload. Maximum power levels are sent to the GPUs by the power controller and are gradually increased, aligning with the ramp rates the grid can tolerate. A more complex strategy is used for ramp-down; if the workload ends abruptly, the GPU burn system continues to dissipate power by operating the GPU in a special power burner mode. This ensures a smooth transition rather than a sharp drop. Figure 3. Power smoothing solution
For rapid, short-term power fluctuations during steady-state operation, electrolytic capacitors have been integrated into the GB300 NVL72 power shelves. Energy storage charges during times of low GPU power demand and discharges during times of high GPU power demand. However electrolytic capacitors are notoriously unreliable and can fail in high temperature environments.
For ramp-down, a software algorithm that senses when the GPU power reduces to idle levels when the running average power drops. The software driver that implements the power smoothing algorithm engages the hardware power burner. The burner keeps using constant power as it waits for the workload to resume; if the workload doesn’t resume, the burner smoothly reduces the power consumption. If the GPU workload does resume, the burner disengages instantly. When a workload ends, the burner tapers off the power draw at a rate consistent with grid capabilities and then disengages.

Table 1. Key configuration parameters that impact power demands
Measured benefits and results
Empirical results with both the previous-generation GB200 and the GB300 power supply units with energy storage demonstrate significant improvements as instrumented in a power shelf in a GB200 rack.
With the previous power supply, the AC power drawn from the grid resembles fluctuations in rack power consumption. The new design eliminates the input power variations and the peak power demand seen by the grid was reduced by 30% when training the Megatron LLM, and rapid fluctuations are substantially dampened.

The Liteon power supply unit for the GB300 Courtesy: Nvidia
Inside the GB300 power supply, about half of the volume is occupied by the capacitors for energy storage. Nvidia worked with power supply vendor LITEON Technology to optimize the power electronics for size, and filled the remaining space with 65 joules/GPU of energy storage. A new charge management controller provided rack-level fast transient power smoothing.
System design implications
Integrating energy storage not only smooths transients but also lowers the peak demand requirements for the rest of the data centre. This avoids the need for facilities to be provisioned for the maximum instantaneous power consumption.
The design ensures that the fluctuations within the rack are tolerated; the computing nodes and internal DC buses are built to accommodate rapid power state changes. The energy storage mechanism is only used to optimize the load profile seen by the grid and does not provide energy back to the utility.
Both the GB200 and GB300 NVL72 systems employ multiple power shelves within each rack. As a result, strategies for integrating energy storage and load smoothing must consider aggregation at the rack and data hall levels. Power reductions at peak enable either increased rack density or reduced provisioning requirements for the entire data centre.
www.nvidia.com; www.infineon.com; www.innoscience.com
If you enjoyed this article, you will like the following ones: don't miss them by subscribing to :
eeNews on Google News
