Microsoft backs Nvidia for AI supercomputer in the cloud
Microsoft is to use the latest A100 GPU from Nvidia for its AI supercomputer in the cloud and opening up the technology to a wide range of applications.
Microsoft announced it would host an AI supercomputer in the cloud with OpenAI system back in May, but did not detail the technology it would use. The virtual machine (VM) developed by Microsoft for AI will combine eight A100 Ampere A100 with an AMD processor.
This architecture, called the ND A100 v4 VM series, can scale up to thousands of GPUs with 1.6 Tbit/s of interconnect bandwidth per VM. Each GPU is provided with its own dedicated topology-agnostic 200 Gbit/s NVIDIA Mellanox HDR InfiniBand connection.
These GPU sub-systems will be coupled with AMD’s ‘Rome’ processors. These use a hybrid multi-die architecture that decouples two streams: eight dies for the processor cores to map directly to the GPUs, and one I/O die that supports security and communication outside the processor. The latest 64 core / 128 thread version, the EPYC 7H12, is built on a 7nm process from TSMC with a 14nm I/O chip in the package. This is designed for liquid-cooled data centre operation with a 2.6GHz base frequency and power consumption up to 280W and delivers up to 4.2TFLOPS.
All this is needed to handle large machine learning models. “The advantage of large scale models is that they only need to be trained once with massive amounts of data using AI supercomputing, enabling them to then be “fine-tuned” for different tasks and domains with much smaller datasets and resources,” said Ian Finder Senior Program Manager, Accelerated HPC Infrastructure at Microsoft.
“The more parameters that a model has, the better it can capture the difficult nuances of the data, as demonstrated by our 17-billion-parameter Turing Natural Language Generation (T-NLG) model and its ability to understand language to answer questions from or summarize documents seen for the first time.”
Training models at this scale requires large clusters of hundreds of machines with specialized AI accelerators interconnected by high-bandwidth networks inside and across the machines.
“We have been building such clusters in Azure to enable new natural language generation and understanding capabilities across Microsoft products, and to power OpenAI on their mission to build safe artificial general intelligence,” said Finder.
This builds on the previous public cloud offering clusters of VMs with Nvidia’s V100 Tensor Core GPUs, connected by a Mellanox InfiniBand network. “Most customers will see an immediate boost of 2x to 3x compute performance over the previous generation of systems based on NVIDIA V100 GPUs with no engineering work,” said Finder, with certain applications seeing a 20x speed up.
The ND A100 v4 VM series and clusters are now in preview and will become a standard offering in the Azure portfolio.
www.nvidia.com; www.microsoft.com; www.amd.com
Related AI supercomputer articles
- ARM-BASED EXASCALE SUPERCOMPUTER TAKES TOP SPOT, TACKLES COVID-19
- CONTINENTAL POWERS UP SUPERCOMPUTER FOR AI TRAINING
- NVIDIA TEAMS FOR MOST POWERFUL ACADEMIC SYSTEM
Other articles on eeNews Europe
- AI engineering jobs in demand across Europe
- Brain simulator AI platform processes 3 billion synapses/s
- First seven customers for 5nm TSMC production
- 5nm ASIC designs start