MENU

Meta details its first custom RISC-V AI silicon

Meta details its first custom RISC-V AI silicon

Feature articles |
By Nick Flaherty



Meta has revealed details of the first chip it designed in house using RISC-V to run AI frameworks for its Facebook and Instagram services.

The company found that GPUs were not always optimal for running Meta’s specific recommendation workloads at the levels of efficiency required for billions of users. So it designed a family of recommendation-specific Meta Training and Inference Accelerator (MTIA) ASICs with the next-generation recommendation model requirements in mind, and integrated into PyTorch.

The first chip was designed in 2020 and built in TSMC’s 7nm process and runs at 800 MHz, providing 102.4 TOPS at INT8 precision and 51.2 TFLOPS at FP16 precision. It has a thermal design power (TDP) of 25 W.

Building custom silicon, especially for the first time, is a significant undertaking. This initial programme provided lessons for the Meta chip roadmap, including architectural insights and software stack enhancements that will lead to improved performance and scale of future systems.

These challenges are becoming increasingly complicated. Looking at historical trends in the industry for scaling compute, memory and interconnect bandwidth are scaling at a much lower pace compared with compute over the last several generations of hardware platforms.

The lagging performance of memory and interconnect bandwidth has also manifested itself in the final performance of the workloads as well. For example, a significant portion of a workload’s execution time spent on networking and communication.

So the design teams are currently focused on striking a balance between compute power, memory bandwidth, and interconnect bandwidth to achieve the best performance for Meta’s workloads.

At a high level, the accelerator consists of a grid of processing elements (PEs), on-chip and off-chip memory resources, and interconnects.

Each PE is equipped with two processor cores (one of them equipped with the vector extension) and a number of fixed-function units that are optimized for performing critical operations, such as matrix multiplication, accumulation, data movement, and nonlinear function calculation. The processor cores are based on the RISC-V open instruction set architecture (ISA) and are heavily customized to perform necessary compute and control tasks.

Each PE also has 128 KB of local SRAM memory for quickly storing and operating on data. The architecture maximizes parallelism and data reuse, which are foundational for running workloads efficiently.

The accelerator is equipped with a dedicated control subsystem that runs the system’s firmware. The firmware manages available compute and memory resources, communicates with the host through a dedicated host interface, and orchestrates job execution on the accelerator.

The memory subsystem uses LPDDR5 for the off-chip DRAM resources and can scale up to 128 GB.

The chip also has 128 MB of on-chip SRAM shared among all the PEs, which provides higher bandwidth and much lower latency for frequently accessed data and instructions.

The chip provides both thread and data level parallelism (TLP and DLP), exploits instruction level parallelism (ILP), and enables abundant amounts of memory-level parallelism (MLP) by allowing numerous memory requests to be outstanding concurrently.

The grid contains 64 PEs organized in an 8×8 configuration. The PEs are connected to one another and to the memory blocks via a mesh network. The grid can be utilized for running a job as a whole, or it can be divided into multiple subgrids that can run independent jobs.

The MTIA accelerators are mounted on small dual M.2 boards, which allows for easier aggregation into a server. These boards are connected to the host CPU on the server using PCIe Gen4 x8 links and consume as little as 35 W.

The servers that host these accelerators use the Yosemite V3 server specification from the Open Compute Project. Each server contains 12 accelerators that are connected to the host CPU and to one another using a hierarchy of PCIe switches so that the different accelerators do not need to involve the host CPU.

This topology allows workloads to be distributed over multiple accelerators and run in parallel. The number of accelerators and the server configuration parameters are carefully chosen to be optimal for executing current and future workloads.

ai.facebook.com/blog/meta-training-inference-accelerator-AI-MTIA/

 

If you enjoyed this article, you will like the following ones: don't miss them by subscribing to :    eeNews on Google News

Share:

Linked Articles
10s