Meta builds AI research ‘supercluster’ supercomputer

Meta builds AI research ‘supercluster’ supercomputer

Technology News |
By Rich Pell

The AI Research SuperCluster (RSC), says the company, is among the fastest AI supercomputers running today and will be the fastest AI supercomputer in the world when it’s fully built out in mid-2022. According to the company, its researchers have already started using RSC to train large models in natural language processing (NLP) and computer vision for research, with the aim of one day training models with trillions of parameters.

The company expects that RSC will help its AI researchers build new and better AI models that can learn from trillions of examples; work across hundreds of different languages; seamlessly analyze text, images, and video together; develop new augmented reality tools; and much more.

“Our researchers will be able to train the largest models needed to develop advanced AI for computer vision, NLP, speech recognition, and more,” says the company. “We hope RSC will help us build entirely new AI systems that can, for example, power real-time voice translations to large groups of people, each speaking a different language, so they can seamlessly collaborate on a research project or play an AR game together. Ultimately, the work done with RSC will pave the way toward building technologies for the next major computing platform 0 the metaverse, where AI-driven applications and products will play an important role.”

To fully realize the benefits of self-supervised learning and transformer-based models, various domains, whether vision, speech, language, or for critical use cases like identifying harmful content, will require training increasingly large, complex, and adaptable models. Computer vision, for example, needs to process larger, longer videos with higher data sampling rates. Speech recognition needs to work well even in challenging scenarios with lots of background noise, such as parties or concerts. NLP needs to understand more languages, dialects, and accents. And advances in other areas, including robotics, embodied AI, and multimodal AI will help people accomplish useful tasks in the real world, says the company.

Until now, the company’s researchers have been using an HPC infrastructure designed in 2017, based on 22,000 NVIDIA V100 Tensor Core GPUs in a single cluster that performs 35,000 training jobs a day.

“In early 2020,” says the company, “we decided the best way to accelerate progress was to design a new computing infrastructure from a clean slate to take advantage of new GPU and network fabric technology. We wanted this infrastructure to be able to train models with more than a trillion parameters on data sets as large as an exabyte — which, to provide a sense of scale, is the equivalent of 36,000 years of high-quality video.”

“While the high-performance computing community has been tackling scale for decades,” says the company, “we also had to make sure we have all the needed security and privacy controls in place to protect any training data we use. Unlike with our previous AI research infrastructure, which leveraged only open source and other publicly available data sets, RSC also helps us ensure that our research translates effectively into practice by allowing us to include real-world examples from Meta’s production systems in model training. By doing this, we can help advance research to perform downstream tasks such as identifying harmful content on our platforms as well as research into embodied AI and multimodal AI to help improve user experiences on our family of apps.”

The company says it believes this is the first time performance, reliability, security, and privacy have been tackled at such a scale. RSC today comprises a total of 760 NVIDIA DGX A100 systems as its compute nodes, for a total of 6,080 GPUs – with each A100 GPU being more powerful than the V100 used in our previous system. Each DGX communicates via an NVIDIA Quantum 1600 Gb/s InfiniBand two-level Clos fabric that has no oversubscription. RSC’s storage tier has 175 petabytes of Pure Storage FlashArray, 46 petabytes of cache storage in Penguin Computing Altus systems, and 10 petabytes of Pure Storage FlashBlade.

Early benchmarks on RSC, compared with the company’s legacy production and research infrastructure, have shown that it runs computer vision workflows up to 20 times faster, runs the NVIDIA Collective Communication Library (NCCL) more than nine times faster, and trains large-scale NLP models three times faster. That means a model with tens of billions of parameters can finish training in three weeks, compared with nine weeks before.

When RSC is complete, the InfiniBand network fabric will connect 16,000 GPUs as endpoints, making it one of the largest such networks deployed to date. Additionally, the company designed a caching and storage system that can serve 16 TB/s of training data, and we plan to scale it up to 1 exabyte.

Beyond the core system itself, there was also a need for a powerful storage solution, one that can serve terabytes of bandwidth from an exabyte-scale storage system. To serve AI training’s growing bandwidth and capacity needs, the company developed a storage service, AI Research Store (AIRStore), from the ground up.

To optimize for AI models, AIRStore utilizes a new data preparation phase that preprocesses the data set to be used for training. Once the preparation is performed one time, the prepared data set can be used for multiple training runs until it expires. AIRStore also optimizes data transfers so that cross-region traffic on the company’s inter-datacenter backbone is minimized.

While RSC is up and running today, its development is ongoing says the company.

“Once we complete phase two of building out RSC, we believe it will be the fastest AI supercomputer in the world, performing at nearly 5 exaflops of mixed precision compute,” says the company. “Through 2022, we’ll work to increase the number of GPUs from 6,080 to 16,000, which will increase AI training performance by more than 2.5x. The InfiniBand fabric will expand to support 16,000 ports in a two-layer topology with no oversubscription. The storage system will have a target delivery bandwidth of 16 TB/s and exabyte-scale capacity to meet increased demand.”

“We expect such a step function change in compute capability to enable us not only to create more accurate AI models for our existing services, but also to enable completely new user experiences, especially in the metaverse. Our long-term investments in self-supervised learning and in building next-generation AI infrastructure with RSC are helping us create the foundational technologies that will power the metaverse and advance the broader AI community as well.”

Meta AI

Related articles:
Microsoft announces new supercomputer, large-scale AI vision
IBM, DOE unveil ‘world’s fastest’ AI supercomputer
Nvidia to build AI supercomputer to predict climate change
Testbed supercomputer sets stage for ‘exascale era’
Tesla unveils custom chip for AI-training supercomputer



Linked Articles