
Meta launches second generation custom AI chip

Meta has launched the second generation of its Meta Training and Inference Accelerator (MTIA) custom AI chip.
The custom chip from Meta (formerly Facebook) is designed with memory-bound large language models (LLMs) using transformer frameworks. This is part of a growing trend for data centre operators to develop their own chips fro specific functions.
The chip architecture is focused on a balance of compute, memory bandwidth, and memory capacity for serving ranking and recommendation models. In inference this needs to be able to provide relatively high utilization, even when batch sizes are relatively low. Outsized SRAM compared to typical GPUs can provide high utilization in cases where batch sizes are limited and provide enough compute for larger amounts of potential concurrent work.
- Meta details its first custom RISC-V AI silicon
- Qualcomm to put Meta’s generative AI onto its chips
- RISC-V boom from edge AI says Facebook’s chief AI scientist
The 5nm accelerator consists of an 8×8 grid of processing elements (PEs) with 2.35bn transistors in a die that measures 25.6mm x 16.4mm, or 421mm2, in a 50mm x 40mm.
The PEs provide significantly increased dense compute performance (3.5x over MTIA v1) and sparse compute performance (7x improvement). This comes partly from improvements in the architecture associated with pipelining of sparse compute. It also comes from the techniques to feed the PE grid with three times the local PE storage at 384kB, twice the on-chip SRAM at 256MB, 3.5X the bandwidth and double the capacity of LPDDR5 at 128GB from the first generation 7nm chip.
An improved network on chip (NoC) architecture that doubles the bandwidth and allows us to coordinate between different PEs at low latency. These and other new functions in the PEs form the key technologies that are vital to our long-term roadmap to scale MTIA to a wider variety of more challenging workloads.
MTIA2 runs at 1.35GHz from a 0.85V supply with a 90W thermal envelope. This provides performance of 708 TFLOPS/s (INT8) with a sparse AI model or 354 TFLOPS/s with INT8.
The chip is housed in a rack-based system that holds up to 72 accelerators. This consists of three chassis, each containing 12 boards that house two accelerators each. The move to clock the chip at 1.35GHz (up from 800 MHz) and run it at 90 watts compared to 25 watts provides denser capabilities with higher compute, memory bandwidth, and memory capacity.
Beyond this, we have upgraded the fabric between the accelerators and between the host and accelerators with There are 8x PCIe Gen5 links for 32 GB/s of bandwidth. There is also the option to add a network interface card to scale out beyond the rack.
The MTIA stack is designed to fully integrate with PyTorch 2.0 and features like TorchDynamo and TorchInductor. Frontend graph-level capturing, analysis, transformation, and extraction mechanisms (such as TorchDynamo, torch.export, etc.) are agnostic to MTIA and are being reused The lower level compiler for MTIA takes the outputs from the frontend and produces highly efficient and device-specific code. This lower level compiler itself consists of a few components that are responsible for generating executable code for models and kernels.
Below this sits the runtime stack responsible for interfacing with the driver/firmware. The MTIA Streaming interface abstraction provides the basic and essential operations that both inference and (in the future) training software require to manage the device memory, as well as run operators and execute compiled graphs on the device. Finally, the runtime interacts with the driver, which sits in user space – a decision we made to enable us to iterate faster on the driver and firmware within our production stack.
We’ve further optimized the software stack by creating the Triton-MTIA compiler backend to generate high-performance code for the MTIA hardware. Triton is an open source language and compiler for writing highly efficient ML compute kernels. It improves developer productivity for writing GPU code and we have found that the Triton language is sufficiently hardware-agnostic to be applicable to non-GPU hardware architectures like MTIA.
The Triton-MTIA backend performs optimizations to maximize hardware utilization and support high-performance kernels. It also exposes key knobs to leverage Triton and MTIA auto-tuning infrastructures to explore the kernel configuration and optimization space.
Support for the Triton language and integration into PyTorch 2 provides extensive coverage for PyTorch operators. Thanks to TorchInductor, for example, developers can use Triton-MTIA in both ahead-of-time (AOT) and just-in-time (JIT) workflows.
Early results show that this next generation silicon has already improved performance by 3x over the first generation chip across four key models. With twice the number of devices and a dual-socket CPU provides a 1.5x performance per watt improvement over the first generation MTIA system.
