VMware backs Graphcore for massive data centre rollouts

VMware backs Graphcore for massive data centre rollouts

Business news |
By Nick Flaherty

Leading UK AI chip startup GraphCore has signed a key deal to support its AI hardware in data centres.

VMware’s Project Radium will support systems from Graphcore as part of its hardware disaggregation initiative. This will enable pooling and sharing of resources over the primary data centre network in virtualized, multi-tenant environments without increasing the complexity to the user or management software.

GraphCore in Bristol has raised $682m in funding from investors such as Dell, BMWi Ventures and Roberty Bosch Venture Capital for its Intelligence Processing Unit (IPU) AI technology based around a chip called Colossus.

The network disaggregated architecture of GraphCore’s IPU-POD racks coupled with flexible resource management features in Project Radium will allow training very large models at scale and deploying models in reliable production environments for AI-based services.  

VMware is already the leading provider of enterprise virtualization software and tools in the cloud and Project Radium is another key development. This enables remoting, pooling and sharing of resources on a wide range of different hardware architectures, including Graphcore IPUs and IPU-PODs. 

GraphCore has also launched IPU-PODs with 32 and 64 petaFLOPS of AI compute. These systems are ideal for cloud hyperscalers, national scientific computing labs and enterprise companies with large AI teams for  faster training of large Transformer-based language models across an entire system, running large-scale commercial AI inference applications in production. This gives more developers IPU access by dividing up the system into smaller, flexible vPODs.

Both IPU-POD128 and IPU-POD256 are shipping to customers today from ATOS and other systems integrator partners and are available to buy in the cloud. 

“We are enthusiastic to add IPU-POD128 and IPU-POD256 systems from Graphcore into our Atos ThinkAI portfolio to accelerate our customers’ capabilities to explore and deploy larger and more innovative AI models across many sectors, including academic research, finance, healthcare, telecoms and consumer internet,” said Agnès Boudot, Senior Vice President, Head of HPC & Quantum at Atos.

One of the first customers to deploy IPU-POD128 is Korea Telecom (KT). “KT is the first company in Korea to provide a ‘Hyperscale AI Service’ utilizing the Graphcore IPUs in a dedicated high-density AI zone within our IDC. Numerous companies and research institutes are currently either using the above service for research and PoCs or testing on the IPU.

Related Graphcore articles

Device virtualization and remoting capabilities are delivered across a multitude of high-performance AI accelerators without the need for explicit code changes or user intervention. Developers can fully concentrate on their models rather than hardware-specific compilers, drivers or software optimizations. 

By dynamically attaching to hardware like IPU-PODs over a standard network, users will be able to leverage high-performance architectures such as the IPU to accelerate more demanding use cases at scale. 

GraphCore’s IPU has a high degree of fine-grained parallelism at the hardware level, supporting single and half precision floating point arithmetic and is ideal for sparse compute without taking any specific dependency on sparsity in the underlying data. The processor is suitable for both training and inference of deep neural networks, but the supporting software that can scale across thousands of processors is key.

Instead of adopting a conventional SIMD/SIMT architecture like GPUs, the IPU uses a MIMD architecture with ultra-high bandwidth, on-chip memory and low-latency/high-bandwidth interconnects for efficient intra- and inter- chip communications. This makes IPUs suitable for parallel machine learning models at datacentre scale is distributed hardware. 

Scaling out from one to thousands of IPUs is supported by the IPU-POD architecture. This is independent of the CPUs and helps users meet workload specific demands on compute resources. For example, machine learning models for natural language processing tasks are generally not CPU intensive whereas computer vision tasks can be as the CPU is used for tasks such as image pre-processing or augmentation.

Besides support for core machine learning software frameworks, integration with virtualization, orchestration and scheduling software is crucial for larges scale deployment.

Resource management components in Graphcore’s software stack facilitate easy integration with a variety of cloud provisioning and management stacks such as the one offered by VMware to support operations in the public cloud, hybrid cloud or on-premises infrastructure.

Other articles on eeNews Europe   



Linked Articles