Microsoft has built one of the world’s most powerful supercomputers dedicated to machine learning, but is missing a key AI metric on power consumption.
The system has been built for OpenAI and is hosted in the Azure supercomputer with more than 285,000 CPU cores, 10,000 GPUs and 400 gigabits per second of network connectivity for each GPU server. This puts it in the top five supercomputers in the world, says Microsoft.
But unlike those discrete supercomputers which report power figures, the AI metric on power is missing from the recent announcement. This matters, as Microsoft says this is a first step toward making the next generation of very large AI models and the infrastructure needed to train them available as a platform.
A new class of massive, self-learning models has been developed by the AI research community as these tasks can be handled more efficiently in terms of computing power. These models have expanded from 1bn parameters last year to over 17bn.
“The exciting thing about these models is the breadth of things they’re going to enable,” said Microsoft Chief Technical Officer Kevin Scott, who said the potential benefits extend far beyond narrow advances in one type of AI model.
“This is about being able to do a hundred exciting things in natural language processing at once and a hundred exciting things in computer vision, and when you start to see combinations of these perceptual domains, you’re going to have new applications that are hard to even imagine right now,” he said.
This focus on the metrics for power consumption and efficiency matters. However Microsoft hasn’t detailed the cores or the interconnect being used. The company has been highly aware of the impact of power consumption on its data centres. It has a commitment to reduce its carbon footprint to zero by 2030 and compensate for its historical impact on the climate by 2050.
At the same time Open AI, which is backed by Microsoft with a $1bn investment, has shown that the number of floating-point operations required to train a classifier to AlexNet-level performance on ImageNet has decreased by a factor of 44 between 2012 and 2019. This corresponds to algorithmic efficiency doubling every 16 months over a period of 7 years. By contrast, Moore’s Law would only have yielded an 11x cost improvement. Both together, the hardware and algorithmic efficiency gains multiply, which suggests that a good model of AI progress should integrate measures from both.
The key is what this means for the large single models running on large hardware arrays. While AI frameworks looks at metrics such as cost and GPU time, it doesn’t look at the power efficiency.
OpenAI is going to start tracking efficiency figures publicly, starting with vision and translation efficiency benchmarks and is looking at adding more benchmarks over time and is encouraging the research community to submit suitable metrics here.
- BROADCOM, MICROSOFT JOIN ACTIVE CABLE CONSORTIUM
- BAIDU TEAMS WITH FACEBOOK, MICROSOFT ON AI MODULE SPECIFICATION
- UNIVERSAL POWER DISTRIBUTION SYSTEM FOR MICROSOFT’S PROJECT OLYMPUS DATA CENTRE DESIGNS
- IMMERSION COOLING FOR 100 SERVER BOARDS
- POWER SCHEME BOOSTS OPTICAL INTERCONNECT PERFORMANCE IN DATA CENTRES