Marking the maturing of supercomputer technology, Nvidia has launched a commercial large memory GPU system for an exascale super computer to train AI frameworks.
The DGX GH200 supercomputer uses 256 of the GH200 Grace Hopper hybrid processors using Nvidia’s NVLink switch system connected to 144TBytes of memory. This is 500x more memory than the previous generation DGX A100, which was introduced in 2020.
The massive shared memory space uses NVLink interconnect technology with the new NVLink Switch System to combine the 256 GH200 superchips, allowing them to perform as a single GPU. The GH200 superchips eliminate the need for a traditional CPU-to-GPU PCIe connection by combining an ARM based Grace CPU with an H100 Tensor Core GPU in the same package.
Nvidia is building a supercomputer called Helios with four DGX GH200 systems, each connected with Nvidia’s Quantum-2 InfiniBand networking to supercharge data throughput for training large AI models. Helios will include 1,024 Grace Hopper Superchips and is expected to come online by the end of the year.
- Nvidia fights back in exascale computing
- UK to build €1bn exascale AI supercomputer
- €270m for RISC-V chiplets to build European exascale supercomputer
Google Cloud, Meta and Microsoft are among the first expected to gain access to the DGX GH200 to explore its capabilities for generative AI workloads. NVIDIA also intends to provide the DGX GH200 design as a blueprint to cloud service providers and other hyperscalers so they can further customize it for their infrastructure.
“Generative AI, large language models and recommender systems are the digital engines of the modern economy,” said Jensen Huang, founder and CEO of Nvidia at the Computex show in Taipei, Taiwan, this morning. “DGX GH200 AI supercomputers integrate NVIDIA’s most advanced accelerated computing and networking technologies to expand the frontier of AI.”
The ‘superchip’ uses Nvidia’s using NVLink-C2C chip interconnect to increase the bandwidth between GPU and CPU by 7x compared with the latest PCIe technology, and, just as importantly, cutting the interconnect power consumption by more than 5x.
The DGX GH200 is the first supercomputer to pair Grace Hopper Superchips with the NVLink Switch System. This is a new interconnect that enables all GPUs in a DGX GH200 system to work together rather than just eight in the previous generation system.
NVLink Switch System forms a two-level, non-blocking, fat-tree NVLink fabric to fully connect 256 Grace Hopper Superchips in a DGX GH200 system. Every GPU in DGX GH200 can access the memory of other GPUs and extended GPU memory of all NVIDIA Grace CPUs at 900 Gbit/s.
Compute baseboards hosting Grace Hopper Superchips are connected to the NVLink Switch System using a custom cable harness for the first layer of NVLink fabric. LinkX cables extend the connectivity in the second layer of NVLink fabric.
“Building advanced generative models requires innovative approaches to AI infrastructure,” said Mark Lohmeyer, vice president of Compute at Google Cloud. “The new NVLink scale and shared memory of Grace Hopper Superchips address key bottlenecks in large-scale AI and we look forward to exploring its capabilities for Google Cloud and our generative AI initiatives.”
“As AI models grow larger, they need powerful infrastructure that can scale to meet increasing demands,” said Alexis Björlin, vice president of Infrastructure, AI Systems and Accelerated Platforms at Meta. “Nvidia’s Grace Hopper design looks to provide researchers with the ability to explore new approaches to solve their greatest challenges.”
“Training large AI models is traditionally a resource- and time-intensive task,” said Girish Bablani, corporate vice president of Azure Infrastructure at Microsoft. “The potential for DGX GH200 to work with terabyte-sized datasets would allow developers to conduct advanced research at a larger scale and accelerated speeds.”
DGX GH200 supercomputers include NVIDIA software to provide a turnkey, full-stack solution for the largest AI and data analytics workloads. Nvidia’s Base Command software provides AI workflow management, enterprise-grade cluster management, libraries that accelerate compute, storage and network infrastructure, and system software optimized for running AI workloads.
Also included is Nvidia AI Enterprise as the software layer. This provides over 100 frameworks, pretrained models and development tools to streamline development and deployment of production AI including generative AI, computer vision and speech.